Hello Pavel,
many thanks for your prompt response. Currently I am trying to implement a back-end solution for this app idea: Felse
There are various separate query types where user search for partners:
- Language tandem
- Travel buddy
- Date
- Compatriots (people speaking the same language in foreign country)
- there will be more sections coming (sports, fresh mothers,…)
As the concept should scale up to millions users, otherwise there would not be much results of queries anyway and the concept idea would get lost, I have to implement a skip list that would be robust enough till let’s say 1 million users.
I ended up going for Parse Server with MongoDB and idea of bloom filters and am currently investigating feasibility of following:
Each user user would have two following documents:
- User document that can be read and write only by user himself and there is nothing indexed apart of default _id
struct PrsUser: ParseUser {
//: These are required for ParseObject
.
var objectId: String?
var createdAt: Date?
var updatedAt: Date?
var ACL: ParseACL?
//: These are required for ParseUser
.
var username: String?
var email: String?
var password: String?
var authData: [String: [String: String]?]?
//TBD custom fields to save on the server
var minimumAge: Int? = 20 //min search age
var maximumAge: Int? = 30 //max search age
var maximumDistance: Int? = 20 //distance in km
//Inverted bloom filters for the containedIn(field, Int) indexed query option
//for each 0 in a bloom filter there is a Int marking a position
//example [0,1,2,3,6,7,8,9] ← here are the bits 4 and 5 used
//this will have to be a bit array for fresh new user [0, 1, 2, …, 32000] or even 64000 if possible
var datingBloomFilter: [Int]?
var tandemBloomFilter: [Int]?
var travelBloomFilter: [Int]?
var compatriotsBloomFilter: [Int]?
}
- Profile document with
struct PrsProfile: ParseObject {
//Those are required for Parse Object
var objectId: String?
var createdAt: Date?
var updatedAt: Date?
var ACL: ParseACL?
//indexed field that are in queries>
var age: Int?
var gender: Int
var bloomFilterHashIndex: Int <— this Indeger is used for a indexed query comparison
…about 10 various fields not indexed separately, only as combinations in a compound index of each query type
//additional information not indexed fields
var name: String
…other profile information
}
I have build compound indexes for each search type (date, language tandem, travel buddy,…) only in the profile class and these seems than the index size is reasonably small (5MB per 100k profile documents)
the query works also well as I investigated and described here on Parse Platform forum.
→ totalKeysExamined 302
→ totalDocsExamined 0
→ nReturned 0
→ executionTimeMillis 13
so there are 0 documents scanned if there is no hit in indexed results. The issue I am struggling with is the possible array size for each bloomFilter in the User document:
- The bloom filter needs to be reasonably large to be effective
- For the query to use indexes there has to be this inverted bloom filter values stored
- Each profile has to have an indexed id into one Integer.
- only one hash/seed is possible. Performing comparison on two bloom filter hashes cause the query to examine documents.
As the user himself would be responsible for his bloom filters I can do following:
- while creating account, user saves his User and Profile document
- in the Profile document could be multiple fields with hash Integers for different bloom filter sizes (the same user id hatched with same function, just for different bloom filter size. for example fields:
- b1: Int = 1673
- b2: Int = 14343
- b3: Int = 24644
- The fresh new User document would have a bloom filter arrays (or better say “available bits arrays”) of a smaller size (4000) and only after a time, when user fills lets say 40% of it, he would recalculate all his served ids (these would need to be held in additional array that could also grow unbound) to a new, larger bloom filter (16000)… and he would use “b2” field in the query.
Although the flexible bloom filter size could save some space and query response time, it would need additional arrays of already served ids. These could be stored as a file somewhere, not needed in the document itself (accessed only few times).
Unfortunately, when I pass such a large array of “available hash integers” to the query, the query response time gets quite higher → 4sec for 64000 array size when I trigger the query at 10QPS lets say. This does not feel as a scalable solution. Keeping array bellow 32000 helps to reduce response down to 1sec or less, but then the bloom filter gets higher false positive.
Advantage of splitting bloom filter for each query type is there… user will have much less “dating” results near him compare to “language tandem” results for english all over the world. So the bigger the searched profile pool is, the less problem it is to have false positive. Example… I would not mind that 50% of profiles are wrongly skipped when there are still 50% of other language tandem partners out there and on the other hand, for the dating partners I will not be able to fill the bloom filter so fast… So there is a high chance there would be a good user experience with bloom filters even of size of 32000bits. Although having it larger is only better.
The issue is still maintaining and updating the bloom filter. this has to be updated every time an user is served by new profiles. He would go through the hashes/seeds and remove them from “array of available Integers.” This seems to be heavy operation when I save such array in to an document (4+sec response time)
So cut the long story short I am currently investigating:
- is there any benefit of storing a bloom filter as raw bit data embedded in the document? Is this even possible so that the size of such bloom filter would be much smaller than size of “array of available Integers”? Current observation through db.stats() shows that 5 such arrays with 32000 Integers in one array takes 1,6MB… this would be multiplied by the user count growing to 1,6TB for million users.
- is loading and writing of such raw bit data significantly faster than large array of integers? I believe both would have to be coded into/from BSON format?
If that would be a case I could think of scenario where the bloom filter would be saved as binary data and the cloud function of Parse Server would first read it, translate it into a “array of available integers” and then run the query.
-
as I have no experience in build such complex system (mechanical engineer doing his first free time project) I can imagine that this could lead to less space consumption in the database, but would ultimately not speed up the query (still comparing a huge array with the bloomFilterHashIndex of the profile) and also would lead to high cpu load while deserialise a bloom filter of 32000+ bits (generating huge array each time the function is called). Out of curiosity I generated a dummy array inside cloud function const bloomFilter = dummyArray(128000);
with following function, it seems still twice so fast than loading an saved array of 128000 from the User document…
function dummyArray(N) {
var a=Array(N),b=0;
while(b<N) a[b++]=b;
return a;
}
Deserialising the raw bit data into a “available integer array” therefore could gain both speed and disc space savings. Even it seem that big portion cannot be speeded up, recomputing of that big array…
So one of the biggest doubts I have is: **is this even reasonable to try with MongoDB or should I rather go and learn ElasticSearch from scratch? Because there you can specify “must_not” and the I could feed in the bloom filter values without need of inverting them.
Any comments, ideas, critics are welcomed! On the interned I found many people trying to achieve such functionality, but no real solution where it can be done with MongoDB without blowing the server after reaching million of users.
Thank you!