Scaling a data model using bloom filters

Hi, I am trying to achieve same functionality in my app - to avoid fetching any document twice - and unfortunately in my case I expect much more user activity.

Let’s say the user will be served with 50 profiles each day and will also go through all of them, needing to fetch another batch at least on the next day. Simple calculation can estimate that “suggestedUsers” array would grow to 503012 = 18.000 after a year.

As I was doing some research, Bloom filter with false positive would be acceptable and I could also hold separate dictionary with dictionary [ filterPosition(Int) : recycleDate(Date) ] for eventual recycling. But here I am not sure how the query would really look like and if there is even a way how to make it perform at scale.

Any idea how to address the same topic on a larger scale where suggested profiles count grows to thousands?

Thank you!

Hi @Lukas_Smilek,

Will be happy to brainstorm and help.

Can you share more details on the schema and query velocity and patterns?

This will help me get up to speed.

Thanks
Pavel

Hello Pavel,

many thanks for your prompt response. Currently I am trying to implement a back-end solution for this app idea: Felse

There are various separate query types where user search for partners:

  • Language tandem
  • Travel buddy
  • Date
  • Compatriots (people speaking the same language in foreign country)
  • there will be more sections coming (sports, fresh mothers,…)

As the concept should scale up to millions users, otherwise there would not be much results of queries anyway and the concept idea would get lost, I have to implement a skip list that would be robust enough till let’s say 1 million users.

I ended up going for Parse Server with MongoDB and idea of bloom filters and am currently investigating feasibility of following:

Each user user would have two following documents:

  1. User document that can be read and write only by user himself and there is nothing indexed apart of default _id

struct PrsUser: ParseUser {

//: These are required for ParseObject.

var objectId: String?
var createdAt: Date?
var updatedAt: Date?
var ACL: ParseACL?
//: These are required for ParseUser.

var username: String?
var email: String?
var password: String?
var authData: [String: [String: String]?]?

//TBD custom fields to save on the server
var minimumAge: Int? = 20 //min search age
var maximumAge: Int? = 30 //max search age
var maximumDistance: Int? = 20 //distance in km

//Inverted bloom filters for the containedIn(field, Int) indexed query option
//for each 0 in a bloom filter there is a Int marking a position
//example [0,1,2,3,6,7,8,9] <- here are the bits 4 and 5 used
//this will have to be a bit array for fresh new user [0, 1, 2, …, 32000] or even 64000 if possible
var datingBloomFilter: [Int]?
var tandemBloomFilter: [Int]?
var travelBloomFilter: [Int]?
var compatriotsBloomFilter: [Int]?

}

  1. Profile document with

struct PrsProfile: ParseObject {

//Those are required for Parse Object

var objectId: String?
var createdAt: Date?
var updatedAt: Date?
var ACL: ParseACL?

//indexed field that are in queries>
var age: Int?
var gender: Int
var bloomFilterHashIndex: Int <— this Indeger is used for a indexed query comparison
…about 10 various fields not indexed separately, only as combinations in a compound index of each query type

//additional information not indexed fields
var name: String
…other profile information

}

I have build compound indexes for each search type (date, language tandem, travel buddy,…) only in the profile class and these seems than the index size is reasonably small (5MB per 100k profile documents)

the query works also well as I investigated and described here on Parse Platform forum.

→ totalKeysExamined 302
→ totalDocsExamined 0
→ nReturned 0
→ executionTimeMillis 13

so there are 0 documents scanned if there is no hit in indexed results. The issue I am struggling with is the possible array size for each bloomFilter in the User document:

  1. The bloom filter needs to be reasonably large to be effective
  2. For the query to use indexes there has to be this inverted bloom filter values stored
  3. Each profile has to have an indexed id into one Integer.
  4. only one hash/seed is possible. Performing comparison on two bloom filter hashes cause the query to examine documents.

As the user himself would be responsible for his bloom filters I can do following:

  1. while creating account, user saves his User and Profile document
  2. in the Profile document could be multiple fields with hash Integers for different bloom filter sizes (the same user id hatched with same function, just for different bloom filter size. for example fields:
  • b1: Int = 1673
  • b2: Int = 14343
  • b3: Int = 24644
  1. The fresh new User document would have a bloom filter arrays (or better say “available bits arrays”) of a smaller size (4000) and only after a time, when user fills lets say 40% of it, he would recalculate all his served ids (these would need to be held in additional array that could also grow unbound) to a new, larger bloom filter (16000)… and he would use “b2” field in the query.

Although the flexible bloom filter size could save some space and query response time, it would need additional arrays of already served ids. These could be stored as a file somewhere, not needed in the document itself (accessed only few times).

Unfortunately, when I pass such a large array of “available hash integers” to the query, the query response time gets quite higher -> 4sec for 64000 array size when I trigger the query at 10QPS lets say. This does not feel as a scalable solution. Keeping array bellow 32000 helps to reduce response down to 1sec or less, but then the bloom filter gets higher false positive.

Advantage of splitting bloom filter for each query type is there… user will have much less “dating” results near him compare to “language tandem” results for english all over the world. So the bigger the searched profile pool is, the less problem it is to have false positive. Example… I would not mind that 50% of profiles are wrongly skipped when there are still 50% of other language tandem partners out there and on the other hand, for the dating partners I will not be able to fill the bloom filter so fast… So there is a high chance there would be a good user experience with bloom filters even of size of 32000bits. Although having it larger is only better.

The issue is still maintaining and updating the bloom filter. this has to be updated every time an user is served by new profiles. He would go through the hashes/seeds and remove them from “array of available Integers.” This seems to be heavy operation when I save such array in to an document (4+sec response time)

So cut the long story short I am currently investigating:

  1. is there any benefit of storing a bloom filter as raw bit data embedded in the document? Is this even possible so that the size of such bloom filter would be much smaller than size of “array of available Integers”? Current observation through db.stats() shows that 5 such arrays with 32000 Integers in one array takes 1,6MB… this would be multiplied by the user count growing to 1,6TB for million users.
  2. is loading and writing of such raw bit data significantly faster than large array of integers? I believe both would have to be coded into/from BSON format?

If that would be a case I could think of scenario where the bloom filter would be saved as binary data and the cloud function of Parse Server would first read it, translate it into a “array of available integers” and then run the query.

  1. as I have no experience in build such complex system (mechanical engineer doing his first free time project) I can imagine that this could lead to less space consumption in the database, but would ultimately not speed up the query (still comparing a huge array with the bloomFilterHashIndex of the profile) and also would lead to high cpu load while deserialise a bloom filter of 32000+ bits (generating huge array each time the function is called). Out of curiosity I generated a dummy array inside cloud function const bloomFilter = dummyArray(128000); with following function, it seems still twice so fast than loading an saved array of 128000 from the User document…

     function dummyArray(N) {
         var a=Array(N),b=0;
         while(b<N) a[b++]=b;
         return a;
     }
    

Deserialising the raw bit data into a “available integer array” therefore could gain both speed and disc space savings. Even it seem that big portion cannot be speeded up, recomputing of that big array…

So one of the biggest doubts I have is: **is this even reasonable to try with MongoDB or should I rather go and learn ElasticSearch from scratch? Because there you can specify “must_not” and the I could feed in the bloom filter values without need of inverting them.

Any comments, ideas, critics are welcomed! On the interned I found many people trying to achieve such functionality, but no real solution where it can be done with MongoDB without blowing the server after reaching million of users.

Thank you!

Hello,

for anyone facing the same issue as me, I believe I achieved pretty solid solution:

  1. as user is responsible for his own bloom filter, he prepares prepares the array of bools, where he hash the UID of the served users into one Integer and based on that “seeds” he updates his bloom filter. That array of bools is then encoded to base64. For testing and dummy bloom filter generation in Swift client I used this snippet:

    enum Bit { case zero, one
         func asInt() -> Int {
             return (self == .one) ? 1 : 0
         }
     }
     
     func generateBloomFilterData(){
         let startDate = Date()
         var randomlyFilledBloomFilter: [Bit] = []
         var bool: Bit = .zero
         for _ in 0...64000 {
             let random = arc4random_uniform(2)
             if random == 0 {
                 bool = .zero
             } else {
                 bool = .one
             }
             randomlyFilledBloomFilter.append(bool)
         }
         
         let numBytes = 1 + (randomlyFilledBloomFilter.count - 1) / 8
         var bytes = [UInt8](repeating: 0, count: numBytes)
    
         for (index, bit) in randomlyFilledBloomFilter.enumerated() {
             if bit == .one {
                 bytes[index / 8] += UInt8(1 << (7 - index % 8))
             }
         }
         
         let data = Data(bytes)
         let base64String = data.base64EncodedString()
    
    --> this is saved directly in to the User document in database
    } 
    

That way takes the bloom filter only 5kB instead of previous approx. 300kB. As I am using 4 separate filters on each user, this is huge improvement in the data transfer speed. Also upon firing the query cloud function the user’s document gets loaded much faster.

  1. the second step is to decode the bloom filter from base64 to “array of available bits.” As Parse Server uses javaScript in the cloud code I used following part of the query code:

     const buf = new Buffer.from(request.user.get("daBlm"), 'base64');
     
     var blm = Array();
    
     for (var i = 0; i < buf.length; i++) {
         //iterating through 8-bit chunks
         const byte = buf[i].toString(2);
         for (var l = 0; l < byte.length; l++) {
             //push int to array for each bloom 0 value - not used bit
             if (byte[l] == "0") {
                 blm.push(i*8 + l);
             }
         }
     }
     
     dateQuery.containedIn("bf", blm);
    

To my huge surprise this does not take much computing effort even for a dummy bloom filter with no seeds - empty one - where the loop above has to generate array of 64000 integers. With this is the user able to query only documents that has hashed UID of a value that is not yet present in his bloom filter. After he gets served on his client device with new profile documents, he recalculates his bloom filter and saves it into his User document in MongoDB database. The next query will then take this updated bloom filter, so even he would trigger next query from other device, he would not get served the same profiles again.

I believe this is pretty robust and scalable solution. Any comments or ideas?

Thank you!