MongoDb - meaning of pipeline aggregation 100mb limit?

Ka_Tech · March 18, 2021, 9:37am

New to Mongodb Atlas I am trying understand the 100mb limit MongoDb for aggregate pipelines. Trying to find out what this actually means? Does it apply to the size of the database collection we are performing the aggregate on?

Bit of background we have the following query on an inventory ledger where we are taking a data set, running a group sum to find out which products are still in-stock (ie amount sum is greater than 0). Based on the result where the product is in stock we return those records by running a lookup in the original collection. The query is provided below.

Assume the inventory objects contains about 10 sub fields/record pair. And assume for 1000records/1mb.

QUESTION My question is if the inventory collection size reaches 100mb as a JSON object array does this mean the call with fail? ie the max we can run the aggregate on is 100mb x 1000 records = 100,000 records?

BTW we are on a server that does not support writing to disk hence the question.

db.inventory.aggregate([
  {
    $group: {
      _id: {
        "group_id": "$product"
      },
      "quantity": {
        $sum: "$quantity"
      }
    }
  },
  {
    "$match": {
      "quantity": {
        $gt: 0
      }
    }
  },
  {
    $lookup: {
      from: "inventory",
      localField: "_id.group_id",
      foreignField: "$product",
      as: "records"
    }
  }
])

Asya_Kamsky · March 19, 2021, 8:53pm

I wish our docs were more clear on this - there is an explanation of this in one of my aggregation talks but it can be hard to dig out. Note that the docs do not say the pipeline is limited to 100MBs, it’s a single stage that’s limited.

Think of the pipeline as a stream of documents - each stage takes in documents and then it outputs documents.

Some stages cannot output any documents until they have accepted all incoming documents. The most obvious example is $group. (1)

That first $group you have needs to see all the documents coming in from the collection. But it only needs to output N documents, where N is the number of distinct product values in the collection. The size of original documents does not matter. The only thing that matters is the size of the “grouped” documents (the ones coming out of this stage), and each of them is just:

{ "_id": { "group_id": "XXXXX"}, "quantity": N }

That’s maybe 53-60 bytes, depending on how long your product field is. So to exceed 100MBs you would need approximately 1.7 million distinct products. More if you remove the sub-object from the _id.

I hope you can see that your aggregation will not fail due to 100MBs limit here. All the other stages in the pipeline are “streaming” stages - meaning the batches of documents stream through them and don’t have to be accumulated all at once.

Asya

(1) the other example is $sort when it’s not supported by an index, and hence causes an in-memory sort.

system · March 24, 2021, 8:53pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.