PyMongo Datatype Conversion

Hello, any suggestion on the best/recommended approach of doing explicit datatype conversion from Json string response (list) to MongoDB datatype ? Similar question here:- https://stackoverflow.com/questions/63623084/pymongo-python-list-to-mongodb-datatype-conversion. The api response returns json string object as a list.

Have followed this approach of converting only specific list of fields that need type cast on a copy of the list:

results = response.json() ['data']
final = [ ]
new_data = results.copy()
for i in enumerate(results):
    try:
        new_data[0]['date'] = datetime.strptime(i[1]['date'],'%m/%d/%Y')
        new_data[0]['amount'] = float(i[1]['amount'])
        new_data[0]['quantity'] = int(float(i[1]['quantity']))
        new_data[0]['price'] = float(i[1]['price'])
        final.append(new_data[0])
    except ValueError as e:
        print(e.details)
        raise

#do **bulk** insert
db.insert_many(final)

However, I am getting the duplicate key error:

{'writeErrors': [{'index': 1, 'code': 11000, 'errmsg': "E11000 duplicate key error collection: testDB.testCollection index: _id_ dup key: { : ObjectId('5f4f638ace84c97b4afe81cf') }",

The length(# of records) are same in both the lists results (original) and final (post type conversion), but the list itself is not identical - may be now due to datatype conversion.

>>results==final
False
>>print(results) ##before datatype conversion
[{'name':'xyz','desc':'abcdefghijk','date': '9/1/2020', 'amount':'100','quantity':'2','price':'50'}]

>>print(final) ##after datatype conversion
[{'name':'xyz','desc':'abcdefghijk','date': datetime.datetime(2020, 9, 01, 0, 0), 'amount':100.0,'quantity':2,'price':50.0}]

I am evaluating Mongoose like PyMongo ODM wrapper called PyMODM for handling such type conversion, however doing it Pythonic way sounds more efficient - if it’s doable.

Just my two cents - regarding usage of any ORM/ODM, since MongoDB is schema-on-read; enforcing a schema/model even before loading into database collection defies the purpose of it being schema-less. Datatype conversion should still be ok for some fields doing it explicitly during pre-processing and before-loading to Mongo database.

Please recommend / suggest a solution.

Thanks!

Hi @mani_k - thanks for your question!

If I understand what you’re doing correctly, you’re reading a bunch of documents from an api, fixing up the types of some of the fields, and then inserting them into your MongoDB collection.

Your problem here isn’t anything to do with the data-type conversions, it’s the fact that you’re attempting to insert new records into your collection with the same _id as existing documents in your collection. I’m not sure where the id is coming from though, as I’m assuming your API doesn’t provide ObjectId values!

Here’s a simplified example, showing that I get the same error:

# Insert a dummy document:
result = temp_collection.insert_one({ 'name': "Old value" })

# Get the id for the document just inserted:
id_just_inserted = result.inserted_id
print(id_just_inserted)    # 5f50cbe859b718350b75bda2

try:
    # Re-use the same `_id` value when inserting a new document:
    temp_collection.insert_one({ '_id': bson.ObjectId(str(id_just_inserted)), 'name': "New value" })
except Exception as e:
    print(e)

I get the following error, which is very similar to yours:

    E11000 duplicate key error collection: schema_test.temp index: _id_ dup key: { _id: ObjectId('5f50cbe859b718350b75bda2') }, full error: {'index': 0, 'code': 11000, 'keyPattern': {'_id': 1}, 'keyValue': {'_id': ObjectId('5f50cbe859b718350b75bda2')}, 'errmsg': "E11000 duplicate key error collection: schema_test.temp index: _id_ dup key: { _id: ObjectId('5f50cbe859b718350b75bda2') }"}

Your problem definitely isn’t a schema validation issue - if it was, you’d see an error like this:

WriteError: Document failed validation, full error: {'index': 0, 'code': 121, 'errmsg': 'Document failed validation'}

If you want to insert new documents, not update existing documents in your collection, you should remove the _id from each of your documents before running the insert_many command. If you would rather update the existing documents in your collection, I’d recommend you use replace_one in a loop.

If you want to update existing documents and insert a new document when the _id already exists in the collection, then you can use replace_one with the upsert parameter set to True

I hope this helps! Let me know if you have any more questions.

2 Likes

Thanks Mark - After your post, I revisited the code and realized that somehow my Python list itself was generating a duplicate row/tuple/list due to which Mongo was generating same _id’s. After fixing the duplicates, it now works fine. Thank you for the great explanation.

1 Like