I want to add the kdBush Algorithm

Aytac_Abay · February 23, 2021, 6:43pm

My question is: I have a query with the mongodb find () method. I’m fetching all my users that match the features I specified in the query. As part of my work, I used the k-mean algorithm on my server side to show users returning from mongodb in a clustered fashion. As the number of Acnak users increases, there will be delays in mongodb. To overcome this delay, I want to apply the K-mean algorithm to the last part of the Mongo answer. So users come to me clustered.

I am currently solving this problem temporarily in nodejs. I want to integrate it directly into mongodb for hanging. I have a k-mean algorithm written in JavaScript. I’ll translate this directly to c ++ and integrate it into mongo. But where is the last answer in mongodb?

I was wondering if anyone knows the exact source code. Where should I add the K-mean algorithm to the final answer section in Mongod?

Bora_Beran · February 25, 2021, 6:00am

Merhaba Aytaç,
I created a ticket to track this work here. I’d expect us to take it on when we do a general push for in-database distributed machine learning along with other popular algorithms.

I am guessing you’d like to do this in the database either because you would like to reduce the data before returning to the client (e.g. return per cluster summaries instead of each individual customer) or expect performance gains from running in the database, even though you still plan on returning a list of all the customers. Also do you intend to run clustering across all users (making it possible to run it once and use it multiple times) or on different subsets (e.g. viewer A could be looking at customers in Region X, while viewer B looking at Region Y and you want to rerun clustering just for X or Y)? Would you mind clarifying?

In the meantime have you considered precomputing centroids using k-means outside MongoDB (e.g. R, Python) then computing squared Euclidian distances between the points and centroids (values from R/Python hardcoded into the query) in MQL then assigning each point to the nearest centroid? That would be a non-trivial MQL query but might possibly help. If you’re rendering all customers it might be easier to do this in Javascript after retrieving the data from MongoDB. Since neighborhood assignment could be done in a single pass vs k-means which takes numerous iterations, that would cut your run time by few orders of magnitude at least. It would also make the results deterministic which you wouldn’t get from k-means with random seed and unless there is a significant change in user count, centroids wouldn’t move much so you would still get meaningful results.

Selamlar,

Bora

Bora_Beran · March 1, 2021, 7:12pm

Not sure what happened to the other thread about KDBush but if I remember your last message correctly is it safe to say what you’d like to get is the ID for the grid cell that KDBush index created because you treat those as clusters?

I think in many scenarios an algorithm like DBSCAN would be more suitable for spatial clustering but if the goal is to get a cell id, what are your thoughts on S2CellId?

Aytac_Abay · March 1, 2021, 8:36pm

Kdbush is rapidly processing 1m data within milliseconds at all zoom values. It is very fast, but the only problem is it works well on static data. I will try to do it dynamically. Then I tried a method with Mongodb aggregate on over 200,000 people, but the room works slowly. I’ve seen a few methods of postgis and I’ll try them. S2CellId I will investigate soon.

Aytac_Abay · March 1, 2021, 8:37pm

Bora, If you have any other ideas, don’t hesitate to tell them