Pymongo ignoring allowDiskUse = True?

I’m trying to delete duplicates from my database. It’s just gotten past the 500k documents mark, so it complains that the aggregate return is too big and I need to allow disk use. So I do. But… nothing happens. Same error.

def deleteDups(datab):
    col = db[datab]
    pipeline = [
    {'$group': {
        '_id': {
            'CASE NUMBER': '$CASE NUMBER',
            'JURISDICTION': '$JURISDICTION'},#needs to be case insensitive
            'count': {'$sum': 1},
            'ids': {'$push': '$_id'}
            }
        },
    {'$match': {'count': {'$gt': 1}}},
    ]
    results = col.aggregate(pipeline, allowDiskUse = True)
    count = 0
    for result in results:
        doc_count = 0
        print(result)
        it = iter(result['ids'])
        next(it)
        for id in it:
            deleted = col.delete_one({'_id': id})
            count += 1
            doc_count += 1
            #print("API call recieved:", deleted.acknowledged) debug, is the database recieving requests

    print("Total documents deleted:", count)

Errors out on results = col.aggregate(pipeline, allowDiskUse = True):

File "C:\Users\Laura\Documents\GitHub\Ant\controller.py", line 202, in deleteDups
    results = col.aggregate(pipeline, allowDiskUse = True)
  File "C:\Python38\lib\site-packages\pymongo\collection.py", line 2375, in aggregate
    return self._aggregate(_CollectionAggregationCommand,
  File "C:\Python38\lib\site-packages\pymongo\collection.py", line 2297, in _aggregate
    return self.__database.client._retryable_read(
  File "C:\Python38\lib\site-packages\pymongo\mongo_client.py", line 1464, in _retryable_read
    return func(session, server, sock_info, slave_ok)
  File "C:\Python38\lib\site-packages\pymongo\aggregation.py", line 136, in get_cursor
    result = sock_info.command(
  File "C:\Python38\lib\site-packages\pymongo\pool.py", line 603, in command
    return command(self.sock, dbname, spec, slave_ok,
  File "C:\Python38\lib\site-packages\pymongo\network.py", line 165, in command
    helpers._check_command_response(
  File "C:\Python38\lib\site-packages\pymongo\helpers.py", line 159, in _check_command_response
    raise OperationFailure(msg % errmsg, code, response)
pymongo.errors.OperationFailure: Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in.

I can’t even right now. I’m expecting this to be something dumb, but for the life of me I can’t see it. Thank you.

Hi @ladylaurel18_N_A, thanks for reproing this issue. Can you please provide:

  • The pymongo version and the server version where you’re seeing this issue:
>>> pymongo.__version__
'3.11.1'
>>> client.server_info()['version']
'4.4.1'
  • what kind of server are you using, is it MongoDB Atlas? Is it a free tier M0 or a paid cluster?

PyMongo 3.11.1 with MongoDB 4.4.1 works correctly as evidenced by this test:

    def test_allowDiskUse(self):
        coll = self.client.test.test_allowDiskUse
        if coll.count_documents({}) < 100:
            str_1mb = 's' * 1024 * 1024
            coll.insert_many([{'s': str_1mb, 'i': i} for i in range(101)])
        large_pipeline = [{'$group': {'_id': '$i', 's': {'$addToSet': '$s'}}}]
        with self.assertRaisesRegex(OperationFailure, 'Exceeded memory limit'):
            list(coll.aggregate(large_pipeline))
        # Passes with allowDiskUse
        list(coll.aggregate(large_pipeline, allowDiskUse=True))
1 Like

PyMongo 3.10.1, and the free Atlas M0. It says version 4.2.10. I updated everything, which did indeed break Python - apparently NumPy is having a bad couple months - rolled back NumPy, and successfully ran it on PyMongo 3.11.1, with this slightly more informative error:

 File "C:\Users\Laura\Documents\GitHub\Ant\controller.py", line 204, in deleteDups
    results = col.aggregate(pipeline, allowDiskUse = True)
  File "C:\Python38\lib\site-packages\pymongo\collection.py", line 2453, in aggregate
    return self._aggregate(_CollectionAggregationCommand,
  File "C:\Python38\lib\site-packages\pymongo\collection.py", line 2375, in _aggregate
    return self.__database.client._retryable_read(
  File "C:\Python38\lib\site-packages\pymongo\mongo_client.py", line 1471, in _retryable_read
    return func(session, server, sock_info, slave_ok)
  File "C:\Python38\lib\site-packages\pymongo\aggregation.py", line 136, in get_cursor
    result = sock_info.command(
  File "C:\Python38\lib\site-packages\pymongo\pool.py", line 683, in command
    return command(self, dbname, spec, slave_ok,
  File "C:\Python38\lib\site-packages\pymongo\network.py", line 159, in command
    helpers._check_command_response(
  File "C:\Python38\lib\site-packages\pymongo\helpers.py", line 160, in _check_command_response
    raise OperationFailure(errmsg, code, response, max_wire_version)
pymongo.errors.OperationFailure: Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in., full error: {'operationTime': Timestamp(1606167414, 43), 'ok': 0.0, 'errmsg': "Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in.", 'code': 16945, 'codeName': 'Location16945', '$clusterTime': {'clusterTime': Timestamp(1606167414, 43), 'signature': {'hash': b'\x83\xf9!~\x9a\xd1\xe6\xab\xe1\xef\xd8v\x9a\xb4\xe7\xe0\xe0\x96\x80\xd5', 'keyId': 6841937452907626500}}}

I didn’t change the code.

PyMongo 3.10.1

The server is a (so far) free Atlas M0, cluster 0 being version 4.2.10

I wonder if I update to PyMongo 3.11.1, will it break everything or fix my problem? :rofl:

You can and should upgrade to pymongo 3.11 (it’s also compatible with MongoDB 4.2) however that won’t fix this issue. Unfortunately, you are running into a documented limitation of Atlas M0 (Free Tier):

Atlas M0 Free Tier clusters do not support the allowDiskUse option for the aggregation command or its helper method.

AFAIK you’ll need to rework your query to use less than 100MB or upgrade to a paid cluster.

2 Likes

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.