Best way to update a field for all documents in (large) DB

Tuan_Dinh · April 28, 2021, 12:49am

Hi All,

I’m facing the following task:

I have 300 millions+ document in collection, and I’d like to update a field for ALL document in that collection. So the task is to retrieve the document, get the field, then do some string processing with it (parse, shorten, and do some hashing - not hard but not exactly trivial), and put that document back to the collection.
My domain language is Java.

My question is: what is the best way to accomplish this task ? Obviously, we can’t get 300M+ document at once, is there a streaming capability in mongo-java that allows my application process the documents returned from the server as a stream.

Another question is what is the most effective way to update the document after it’s retrieved and amended ? Even better, is there a way that the above task is done at the server side (so we don’t have to get, amend, and put back) ? This is similar to a simple update query (where we can “set” the value we want, and send it at once, the update is done at the server).

Comments, Suggestions, Questions are welcome.

Thanks,
Tuan

Prasad_Saya · April 28, 2021, 5:03am

Hello @Tuan_Dinh,

To update large set of documents, use Bulk Write operations. The Java client (uses MongoDB Java Driver) code you write and execute submits the update query to the server, the update happens on the server and you get the result (status, number of documents updated, etc.) back at the Java client.

Bulk updates send large set of updates calls as a single request; i.e., one request to the server and get one result of the update from the server. It is efficient as there is minimum network usage to get to the server. All the work happens at the database server.

You can use Updates with Aggregation Pipeline for complex string manipulation and update operation.

Tuan_Dinh · May 3, 2021, 5:49am

Hi @Prasad_Saya,

Thank for the reply, but can you elaborate this a bit ? I’m familiar with BulkWrite operation but still can’t get the details right. Consider the following example:

Let’s say we have current collection as below:

Each document has a token field that is comprised of 3 parts.

	{
		name: "John",
		token : "part1-part2-part3" 
	},
		{
		name: "Peter",
		token : "part1-part2-part3" 
	},
		{
		name: "Jack",
		token : "part1-part2-part3" 
	}
	....
]

The task is to update ALL document so that for each document we update the token by the logic:

Remove part2 from the token (as a string)
Perforn a sha256 hash, then
Update the token with that value.

Desired:

	{
		name: "John",
		token : "sha256Hash(part1-part3)" 
	},
		{
		name: "Peter",
		token : "sha256Hash(part1-part3)" 
	},
		{
		name: "Jack",
		token : "sha256Hash(part1-part3)" 
	}
	....
]

How exactly would the BulkWrite look like ? Let’s say we choose UpdateOne as the write operation in the BulkWrite. UpdateOne requires a filter (to find the document to update, typically a Query() object but there’s no criteria here as we want to update every single document), and then the “update” part, how to set the value that we are after (with hasing)? Also, there’s another important constraint as it’s a large data set with 300M+ documents. Even if we use BulkWrite, it can’t finish the operation in ONE command (one round trip) can’t it ?

Appreciate the follow-up.

Regards,
Tuan

Prasad_Saya · May 3, 2021, 1:40pm

Hello @Tuan_Dinh,

I guess, you have to plan your own operations. Please see this note on how bulk writes are batched: Bulk Write - Execution of Operations.

The number of operations in each group cannot exceed the value of the maxWriteBatchSize of the database. As of MongoDB 3.6, this value is 100,000. This value is shown in the isMaster.maxWriteBatchSize field.

This limit prevents issues with oversized error messages. If a group exceeds this limit, the client driver divides the group into smaller groups with counts less than or equal to the value of the limit. For example, with the maxWriteBatchSize value of 100,000, if the queue consists of 200,000 operations, the driver creates 2 groups, each with 100,000 operations.

The update method has features which allow splitting the three part string token into two required tokens (see Update with Aggreagtion Pipelein); which in turn can be used to get the sha256 hash. Creating this sha256 hash is external to MongoDB functionality.

MongoDB v4.4 and higher has the $function operator to create a JavaScript function to calculate the sha256 hash within the update operation. This means, with MongoDB v4.4 and JavaScript, you can perform the update operation for each document on the server side.

If your MongoDB version is prior to 4.4, then you have to figure some way within your application code to perform the first two steps to generate a value, then perform the update operation to update the token with the generated value.

Please try some code, and post for any improvements, corrections and suggestions. To start with, you can assume there are a small number (for example 3) of documents to update, and try.

Since you had mentioned about using MongoDB Java Driver in the initial post, see this Java Driver Tutorials - Write Operations - Bulk Writes for guidance.