Hi,
I have built a service similar to Shodan which catalogues and indexes internet hosts. I’m currently really struggling with updating documents at scale and would appreciate some advice.
Current Architecture (all distributed, all running latest Mongo version 4.4)
-
3 x Mongos
-
3 x configsvr in replicaset (configReplSet)
-
10 x shardsvr in replicaset (shardReplSet)
Note: all nodes communicate over at least a 1GBps line and have an uptime of > 99.9%.
The cluster has a unique index on the “ip” field
I use Motor as my Python-Mongo adapter, which itself sits on top of Pymongo.
The service itself will perform a scan and aggregate details about a host, it will then prepare them for insertion into the mongo cluster. I use a simple try/catch block for initial inserts:
try:
results = await self.mongo.insert_many(prepared_hosts, ordered=False)
except BulkWriteError as e:
details: dict = e.details
self.logger.info(f"There were some host insertion errors:"
f"\nwriteErrors: {len(details.get('writeErrors'))}"
f"\nnInserted: {details.get('nInserted')}"
f"\nwriteConcernErrors: {details.get('writeConcernErrors')}"
f"\nnUpserted: {details.get('nUpserted')}"
f"\nnMatched: {details.get('nModified')}"
f"\nnRemoved: {details.get('upserted')}"
)
Note: self.mongo.insert_many is a wrapper around Pymongo’s insert_many
Inside the catch block I aggregate the the writeErrors and any that throw 11000 (# E11000 duplicate key error collection) are prepared for update. I And this is where the problems begin…
Attempt 1
- Pull the existing document from the collection
- Merge the new and old data
- Perform an update_one
Because each job was producing anywhere from 100,000 to 10,000,000 documents for processing, this bottleneck more often than not caused the job to run for days.
Attempt 2
- Push any duplicate documents to a “staging” queue
- Large amount of workers consume the queue to perform updates in more-or-less the same process as laid out in Attempt 1
This meant the jobs were finished in a very reasonable time, but even with 256 workers I could only manage approximately 1000 updates/second and the queue simply ballooned out of control, spiralling upwards of 200,000,000 documents within a few days.
And that is where I am, today. I would appreciate any thoughts on the matter, whether it’s process, architecture, both or whatever.