How to determine whether an upsert resulted in an insert or update WITH BULK OPERATION

Tuan_Dinh1 · January 8, 2021, 1:31am

Hi All,

My use case is that I want to upsert many documents (>1000, large enough that bulk operation seems the only suitable way). As “upsert” suggests, it is either an insert or an update.

So I use BulkWrite, with ReplaceOne model, and with ReplaceOptions.upsert(true).
So far, the bulk operation only returns a BulkWriteResult with overall inserted, modified, matched counts.

My question is: how do I know whether update or insert actually carried out for each operation in the bulk ? (To know which document got updated, and which one got inserted)
OR is there another alternative to achieve this ? I reckon one can perform update(upsert=true) for individual document but it’s too costly in my use case.

Wellcome all comments, feedback and suggestions
Tuan

Please help!

MaBeuLux88 · January 8, 2021, 2:11am

Hi @Tuan_Dinh1 and welcome in the MongoDB Community !

I wrote a little test in Python to see what I could do:

from pprint import pprint

from pymongo import MongoClient, ReplaceOne, ASCENDING


def init_mongodb():
    global coll
    client = MongoClient()
    db = client.get_database('test')
    coll = db.get_collection('coll')


if __name__ == '__main__':
    init_mongodb()

    # init content of collection for the example
    coll.delete_many({})
    coll.insert_many([
        {'_id': 2, 'name': 'Not-Lauren'},
        {'_id': 3, 'name': 'Not-Mark'}
    ])

    print('Collection content BEFORE')
    for doc in coll.find().sort('_id', ASCENDING):
        print(doc)

    bulk_ops = [
        ReplaceOne({'_id': 1}, {'name': 'Max'}, upsert=True),
        ReplaceOne({'_id': 2}, {'name': 'Lauren'}, upsert=True),
        ReplaceOne({'_id': 3}, {'name': 'Mark'}, upsert=True)
    ]
    bulk_result = coll.bulk_write(bulk_ops, ordered=True)

    print('\nBulk Result')
    pprint(bulk_result.bulk_api_result)

    print('\nCollection content AFTER')
    for doc in coll.find().sort('_id', ASCENDING):
        print(doc)

Which prints:

Collection content BEFORE
{'_id': 2, 'name': 'Not-Lauren'}
{'_id': 3, 'name': 'Not-Mark'}

Bulk Result
{'nInserted': 0,
 'nMatched': 2,
 'nModified': 2,
 'nRemoved': 0,
 'nUpserted': 1,
 'upserted': [{'_id': 1, 'index': 0}],
 'writeConcernErrors': [],
 'writeErrors': []}

Collection content AFTER
{'_id': 1, 'name': 'Max'}
{'_id': 2, 'name': 'Lauren'}
{'_id': 3, 'name': 'Mark'}

As you can see from the output I get from the bulk operations, I know that {_id: 1} was upserted and the 2 others found a matching document so they performed a replace operation.

I hope this helps.

Cheers,
Maxime.

Tuan_Dinh1 · January 8, 2021, 4:20am

Thanks @MaBeuLux88 for a very detailed explanation (Must say my experience in this community forum is pretty positive so far)

My domain language is Java, but I realised the similar thing, there is a BulkWriteUpsert list returned in the BulkWriteResult with only the document that are INSERTED and their indexes on the original list. From there, I was able to figure out which ones have been updated.

Another question though: Is there a good way to obtain document “before” it is replaced ?

So far, I have gone the long way, before running the upsert, I take the snapshot of the DB, then I run the upsert operation, from the result I work out the documents have been updated (there’s an extra field acting like unique ref). After that, I restore the DB using the previous snapshot and query for the “old” documents using those ref. I can tell it a long way, but can’t think of a faster way.

Suggestions ?
Tuan

MaBeuLux88 · January 8, 2021, 4:37am

I was about to suggest to use the Change Streams with a filter on the updates only + updateLookup but you will only get the new version of the documents, not the old one.

You could fall back on findOneAndUpdate or findOneAndReplace which both support the options upsert and returnNewDocument.

Also, as you already know which filters you will use to replaceOne your documents in your bulk operations, maybe you could also run some find operations before running your bulk. Document that don’t exist yet won’t be matched and you could retrieve the old versions of the one that do exist.

I bet these 2 solutions are faster than the restore snapshot one .

Cheers,
Maxime.