What is most efficient way to get the data from a 10million data collections?

I have a collection containing 10 million docs, each with a unique content ID. Many entries share the same content ID, but I need to retrieve only the most recent entries for each content ID. Although I’ve indexed the collection, the query still takes around 30 seconds to execute.
When I am adding group by it is taking too much time more than 1min. Now I am thinking mongo is not for group by option. Please suggest any one , if you have any suggestion

What does your query look like as well as a document?
What does an explain show?
What exact indexes have you created?
On-Prem or Atlas and if Atlas what tier?
Is it a cluster or single server?

Put a sample document in mongo playground so people can see exactly what you’re doing:

[
  {
    $match: {
       "state":{$in:["DRAFT","PUBLISHED","READY_FOR_PUBLISHING","SCHEDULE_PUBLISH"]},
      "isDeleted":false,
      "contentCategory":"TV_SHOW" 
    },
  },
  { $sort: { createdAt: -1, updatedAt: -1 } },
  {
    $group: {
      _id: "$contentId",
      latestVersion: { $first: "$$ROOT" },
      versionCount: { $sum: 1 },
    },
  },
  {
    $replaceRoot: {
      newRoot: {
        $mergeObjects: [
          "$latestVersion",
          { versionCount: "$versionCount" },
        ],
      },
    },
  },

  {
    $lookup: {
      from: "ContentMetaCategories",
      localField: "_id",
      foreignField: "contentId",
      as: "ContentMetaCategory",
    },
  },
  {
    $addFields: {
      ContentMetaCategory: {
        $map: {
          input: "$ContentMetaCategory",
          as: "meta",
          in: {
            $mergeObjects: [
              "$$meta",
              "$$meta.attributes",
            ],
          },
        },
      },
    },
  },

  { $sort: { createdAt: -1 } },
  {
    $project: {
      "ContentMetaCategory.attributes": 0,
      sortByCustomKey: 0,
      convertedContentId: 0,
    },
  },
  {
    $facet: {
      paginatedResults: [
        { $skip: 0 },
        { $limit: 10 },
      ],
      queryDocumentCount: [
        { $count: "queryCount" },
      ],
    },
  },
]

explain is showing below

{
  "explainVersion": "2",
  "stages": [
    {
      "$cursor": {
        "queryPlanner": {
          "namespace": "testingDb.Contents",
          "indexFilterSet": false,
          "parsedQuery": {
            "$and": [
              {
                "contentCategory": {
                  "$eq": "TV_SHOW"
                }
              },
              {
                "isDeleted": {
                  "$eq": false
                }
              },
              {
                "state": {
                  "$in": [
                    "DRAFT",
                    "PUBLISHED",
                    "READY_FOR_PUBLISHING",
                    "SCHEDULE_PUBLISH"
                  ]
                }
              }
            ]
          },
          "queryHash": "CD1F4F0F",
          "planCacheKey": "7640F37A",
          "maxIndexedOrSolutionsReached": false,
          "maxIndexedAndSolutionsReached": false,
          "maxScansToExplodeReached": false,
          "winningPlan": {
            "queryPlan": {
              "stage": "GROUP",
              "planNodeId": 3,
              "inputStage": {
                "stage": "FETCH",
                "planNodeId": 2,
                "filter": {
                  "$and": [
                    {
                      "contentCategory": {
                        "$eq": "TV_SHOW"
                      }
                    },
                    {
                      "isDeleted": {
                        "$eq": false
                      }
                    },
                    {
                      "state": {
                        "$in": [
                          "DRAFT",
                          "PUBLISHED",
                          "READY_FOR_PUBLISHING",
                          "SCHEDULE_PUBLISH"
                        ]
                      }
                    }
                  ]
                },
                "inputStage": {
                  "stage": "IXSCAN",
                  "planNodeId": 1,
                  "keyPattern": {
                    "createdAt": -1,
                    "updatedAt": -1
                  },
                  "indexName": "createdAt_-1_updatedAt_-1",
                  "isMultiKey": false,
                  "multiKeyPaths": {
                    "createdAt": [],
                    "updatedAt": []
                  },
                  "isUnique": false,
                  "isSparse": false,
                  "isPartial": false,
                  "indexVersion": 2,
                  "direction": "forward",
                  "indexBounds": {
                    "createdAt": [
                      "[MaxKey, MinKey]"
                    ],
                    "updatedAt": [
                      "[MaxKey, MinKey]"
                    ]
                  }
                }
              }
            },
            "slotBasedPlan": {
              "slots": "$$RESULT=s23 env: { s3 = 1711564015655 (NOW), s17 = [\"PUBLISHED\", \"READY_FOR_PUBLISHING\", \"SCHEDULE_PUBLISH\", \"DRAFT\"], s16 = false, s2 = Nothing (SEARCH_META), s1 = TimeZoneDatabase(America/Argentina/San_Juan...Australia/Eucla) (timeZoneDB), s8 = {\"createdAt\" : -1, \"updatedAt\" : -1}, s15 = \"TV_SHOW\" }",
              "stages": "[3] mkobj s23 [_id = s18, latestVersion = s19, versionCount = s21] true false \n[3] group [s18] [s19 = first((s9 ?: null)), s21 = sum(1)] spillSlots[s20, s22] mergingExprs[first((s20 ?: null)), sum(s22)] \n[3] project [s18 = (s11 ?: null)] \n[2] filter {(traverseF(s12, lambda(l1.0) { ((l1.0 == s15) ?: false) }, false) && (traverseF(s13, lambda(l2.0) { ((l2.0 == s16) ?: false) }, false) && traverseF(s14, lambda(l3.0) { isMember(l3.0, s17) }, false)))} \n[2] nlj inner [] [s4, s5, s6, s7, s8] \n    left \n        [1] ixseek KS(0F0F0104) KS(F5F5FE04) s7 s4 s5 s6 lowPriority [] @\"0e866e4f-1945-43b4-86fd-9fd9c3b32302\" @\"createdAt_-1_updatedAt_-1\" true \n    right \n        [2] limit 1 \n        [2] seek s4 s9 s10 s5 s6 s7 s8 [s11 = contentId, s12 = contentCategory, s13 = isDeleted, s14 = state] @\"0e866e4f-1945-43b4-86fd-9fd9c3b32302\" true false \n"
            }
          },
          "rejectedPlans": []
        }
      }
    },
    {
      "$replaceRoot": {
        "newRoot": {
          "$mergeObjects": [
            "$latestVersion",
            {
              "versionCount": "$versionCount"
            }
          ]
        }
      }
    },
    {
      "$lookup": {
        "from": "ContentMetaCategories",
        "as": "ContentMetaCategory",
        "localField": "_id",
        "foreignField": "contentId"
      }
    },
    {
      "$addFields": {
        "ContentMetaCategory": {
          "$map": {
            "input": "$ContentMetaCategory",
            "as": "meta",
            "in": {
              "$mergeObjects": [
                "$$meta",
                "$$meta.attributes"
              ]
            }
          }
        }
      }
    },
    {
      "$sort": {
        "sortKey": {
          "createdAt": -1
        }
      }
    },
    {
      "$project": {
        "sortByCustomKey": false,
        "convertedContentId": false,
        "ContentMetaCategory": {
          "attributes": false
        },
        "_id": true
      }
    },
    {
      "$facet": {
        "paginatedResults": [
          {
            "$internalFacetTeeConsumer": {}
          },
          {
            "$limit": 10
          }
        ],
        "queryDocumentCount": [
          {
            "$internalFacetTeeConsumer": {}
          },
          {
            "$group": {
              "_id": {
                "$const": ""
              },
              "queryCount": {
                "$sum": {
                  "$const": 1
                }
              }
            }
          },
          {
            "$project": {
              "queryCount": true,
              "_id": false
            }
          }
        ]
      }
    }
  ],
  "serverInfo": {
    "host": "Pradyumnas-MacBook-Pro.local",
    "port": 27017,
    "version": "7.0.2",
    "gitVersion": "02b3c655e1302209ef046da6ba3ef6749dd0b62a"
  },
  "serverParameters": {
    "internalQueryFacetBufferSizeBytes": 104857600,
    "internalQueryFacetMaxOutputDocSizeBytes": 104857600,
    "internalLookupStageIntermediateDocumentMaxSizeBytes": 104857600,
    "internalDocumentSourceGroupMaxMemoryBytes": 104857600,
    "internalQueryMaxBlockingSortMemoryUsageBytes": 104857600,
    "internalQueryProhibitBlockingMergeOnMongoS": 0,
    "internalQueryMaxAddToSetBytes": 104857600,
    "internalDocumentSourceSetWindowFieldsMaxMemoryBytes": 104857600,
    "internalQueryFrameworkControl": "trySbeEngine"
  },
  "command": {
    "aggregate": "Contents",
    "pipeline": [
      {
        "$match": {
          "state": {
            "$in": [
              "DRAFT",
              "PUBLISHED",
              "READY_FOR_PUBLISHING",
              "SCHEDULE_PUBLISH"
            ]
          },
          "isDeleted": false,
          "contentCategory": "TV_SHOW"
        }
      },
      {
        "$sort": {
          "createdAt": -1,
          "updatedAt": -1
        }
      },
      {
        "$group": {
          "_id": "$contentId",
          "latestVersion": {
            "$first": "$$ROOT"
          },
          "versionCount": {
            "$sum": 1
          }
        }
      },
      {
        "$replaceRoot": {
          "newRoot": {
            "$mergeObjects": [
              "$latestVersion",
              {
                "versionCount": "$versionCount"
              }
            ]
          }
        }
      },
      {
        "$lookup": {
          "from": "ContentMetaCategories",
          "localField": "_id",
          "foreignField": "contentId",
          "as": "ContentMetaCategory"
        }
      },
      {
        "$addFields": {
          "ContentMetaCategory": {
            "$map": {
              "input": "$ContentMetaCategory",
              "as": "meta",
              "in": {
                "$mergeObjects": [
                  "$$meta",
                  "$$meta.attributes"
                ]
              }
            }
          }
        }
      },
      {
        "$sort": {
          "createdAt": -1
        }
      },
      {
        "$project": {
          "ContentMetaCategory.attributes": 0,
          "sortByCustomKey": 0,
          "convertedContentId": 0
        }
      },
      {
        "$facet": {
          "paginatedResults": [
            {
              "$skip": 0
            },
            {
              "$limit": 10
            }
          ],
          "queryDocumentCount": [
            {
              "$count": "queryCount"
            }
          ]
        }
      }
    ],
    "cursor": {},
    "$db": "testingDb"
  },
  "ok": 1
}

It is at Atlas.

It is a cluster

Please read Formatting code and log snippets in posts and update your code and sample documents so that we can easily experiment with.

You’re grouping by contentId but not sorting by that. Try adding an index that’ll cover the initial match and sort and groupl.

I want data to group by contentId and sort it with latest date. I have added group by key and sort key in index also.

Hope now you are able to read my query.

You can run the explain query like this:

explain(“executionStats”)

Which will show the actual details of how it was executed, this will show details like how many documents entered a stage and keys hit / document fields that needed to be retrieved and can be useful to see possibly issues with queries.

Have you also tried commenting out stages to identify exactly where the issue lies?

If you want to group by ContentID and take the latest and make use of an index you’ll want to sort by the group field and the field you want to base the operation on, in the above case:
contentId, createdAt and updatedAt. If you sort by just the createdAt and UpdatedAt before you hit the group then the server cannot optimise the group based on the shape of data as it hits it.

Running a test collection locally with 1M docs and an index on:

categoryID:1
createDate:-1
updateDate:-1

The following takes 3 seconds or so:

db.getCollection('test2').aggregate([
{
    $sort:{
        createDate:-1,
        updateDate:-1
    }
},
{
    $group:{
        _id:'$categoryID',
        firstDoc:{$first:'$$ROOT'}
    }
},
]).toArray()

and the following takes a few ms:

db.getCollection('test2').aggregate([
{
    $sort:{
        categoryID:1,
        createDate:-1,
        updateDate:-1
    }
},
{
    $group:{
        _id:'$categoryID',
        firstDoc:{$first:'$$ROOT'}
    }
},
]).toArray()

In the second case, the $group stage can make use of the data as it enters as it’s in the best shape to group the data and the index was used to sort it.

In your case, you also have a $match before the sort, so adding those fields to the index should enable you to do at least the top part of the query quickly.

2 Likes

Yes I can. Thanks for the update.