How to loop through MongoDB's all records(with resume from a particular record)

Nithin_Bandaru · May 26, 2020, 7:54pm

Hi,

I have a Mongo DB table that stores time-series data.

_id is the only unique field in the database.
I have an index of a time field with -1, so the index is created in descending order.

I am trying to Mongo DB schema transformation and I have written code to use AutoMapper library to do the transformation for each document.

I am reading document by document and transforming it and inserting them. Have configurable features for Bulk Read(Using Skip and Limit) and Bulk Write for speed.

I just wanted to know if suppose some error occurs and fails then how can I resume?

Can I use this _id value to resume the migration?

Wondering if I can use _id as a filter condition because there is an index on this column.
I can’t use the time to filter because this column is not unique and there we could miss few records to migrate if using greater than and will insert duplicate in using less than.

More importantly the _id column is unique. If I am able to apply $gt on this field then I will be able to jump to the last processed location without any overhead because of the index.

Basically what I am saying is I need to have one unique column that can be comparable for greater or lesser. _id column is one of them. Not sure if it is comparable. Is it?

Update: Maybe _id this is not unique https://docs.mongodb.com/manual/reference/bson-types/#objectid Not sure what this link meant. My head is biting because of reading too much information. Man this quarantine has given me a big problem and that’s time.

With regards,
Nithin B

kevinadi · June 5, 2020, 7:17am

Hi Nithin,

I believe you can use the _id field as you proposed. However there are some caveats:

If I am able to apply $gt on this field then I will be able to jump to the last processed location without any overhead because of the index.

This is correct, as long as your _id field is monotonically increasing and can be used for sorting in this manner. This will not work if your _id field contains a semi-random string like UUID or similar.

If you’re using the auto generated ObjectId, this should work. Unless your deployment is a sharded cluster. The _id field is required to be unique within each collection. It is guarded by a unique index, so attempting to insert a document with a duplicate _id will result in an error.

If you need further help with this, could you post:

Your MongoDB deployment topology (standalone/replica set/sharded cluster)
Some example document, showing especially the _id field

Best regards,
Kevin

Nithin_Bandaru · April 1, 2021, 9:01pm

So, If its a Sharded Cluster we can use Shard Key + _id to make it unique across.

Thanks.