What is most efficient way to get the data from a 10million data collections?

Pradyumna_Mishra · March 27, 2024, 6:06pm

I have a collection containing 10 million docs, each with a unique content ID. Many entries share the same content ID, but I need to retrieve only the most recent entries for each content ID. Although I’ve indexed the collection, the query still takes around 30 seconds to execute.
When I am adding group by it is taking too much time more than 1min. Now I am thinking mongo is not for group by option. Please suggest any one , if you have any suggestion

John_Sewell · March 27, 2024, 6:10pm

What does your query look like as well as a document?
What does an explain show?
What exact indexes have you created?
On-Prem or Atlas and if Atlas what tier?
Is it a cluster or single server?

Put a sample document in mongo playground so people can see exactly what you’re doing:

Pradyumna_Mishra · March 27, 2024, 6:29pm

[
  {
    $match: {
       "state":{$in:["DRAFT","PUBLISHED","READY_FOR_PUBLISHING","SCHEDULE_PUBLISH"]},
      "isDeleted":false,
      "contentCategory":"TV_SHOW" 
    },
  },
  { $sort: { createdAt: -1, updatedAt: -1 } },
  {
    $group: {
      _id: "$contentId",
      latestVersion: { $first: "$$ROOT" },
      versionCount: { $sum: 1 },
    },
  },
  {
    $replaceRoot: {
      newRoot: {
        $mergeObjects: [
          "$latestVersion",
          { versionCount: "$versionCount" },
        ],
      },
    },
  },

  {
    $lookup: {
      from: "ContentMetaCategories",
      localField: "_id",
      foreignField: "contentId",
      as: "ContentMetaCategory",
    },
  },
  {
    $addFields: {
      ContentMetaCategory: {
        $map: {
          input: "$ContentMetaCategory",
          as: "meta",
          in: {
            $mergeObjects: [
              "$$meta",
              "$$meta.attributes",
            ],
          },
        },
      },
    },
  },

  { $sort: { createdAt: -1 } },
  {
    $project: {
      "ContentMetaCategory.attributes": 0,
      sortByCustomKey: 0,
      convertedContentId: 0,
    },
  },
  {
    $facet: {
      paginatedResults: [
        { $skip: 0 },
        { $limit: 10 },
      ],
      queryDocumentCount: [
        { $count: "queryCount" },
      ],
    },
  },
]

explain is showing below

{
  "explainVersion": "2",
  "stages": [
    {
      "$cursor": {
        "queryPlanner": {
          "namespace": "testingDb.Contents",
          "indexFilterSet": false,
          "parsedQuery": {
            "$and": [
              {
                "contentCategory": {
                  "$eq": "TV_SHOW"
                }
              },
              {
                "isDeleted": {
                  "$eq": false
                }
              },
              {
                "state": {
                  "$in": [
                    "DRAFT",
                    "PUBLISHED",
                    "READY_FOR_PUBLISHING",
                    "SCHEDULE_PUBLISH"
                  ]
                }
              }
            ]
          },
          "queryHash": "CD1F4F0F",
          "planCacheKey": "7640F37A",
          "maxIndexedOrSolutionsReached": false,
          "maxIndexedAndSolutionsReached": false,
          "maxScansToExplodeReached": false,
          "winningPlan": {
            "queryPlan": {
              "stage": "GROUP",
              "planNodeId": 3,
              "inputStage": {
                "stage": "FETCH",
                "planNodeId": 2,
                "filter": {
                  "$and": [
                    {
                      "contentCategory": {
                        "$eq": "TV_SHOW"
                      }
                    },
                    {
                      "isDeleted": {
                        "$eq": false
                      }
                    },
                    {
                      "state": {
                        "$in": [
                          "DRAFT",
                          "PUBLISHED",
                          "READY_FOR_PUBLISHING",
                          "SCHEDULE_PUBLISH"
                        ]
                      }
                    }
                  ]
                },
                "inputStage": {
                  "stage": "IXSCAN",
                  "planNodeId": 1,
                  "keyPattern": {
                    "createdAt": -1,
                    "updatedAt": -1
                  },
                  "indexName": "createdAt_-1_updatedAt_-1",
                  "isMultiKey": false,
                  "multiKeyPaths": {
                    "createdAt": [],
                    "updatedAt": []
                  },
                  "isUnique": false,
                  "isSparse": false,
                  "isPartial": false,
                  "indexVersion": 2,
                  "direction": "forward",
                  "indexBounds": {
                    "createdAt": [
                      "[MaxKey, MinKey]"
                    ],
                    "updatedAt": [
                      "[MaxKey, MinKey]"
                    ]
                  }
                }
              }
            },
            "slotBasedPlan": {
              "slots": "$$RESULT=s23 env: { s3 = 1711564015655 (NOW), s17 = [\"PUBLISHED\", \"READY_FOR_PUBLISHING\", \"SCHEDULE_PUBLISH\", \"DRAFT\"], s16 = false, s2 = Nothing (SEARCH_META), s1 = TimeZoneDatabase(America/Argentina/San_Juan...Australia/Eucla) (timeZoneDB), s8 = {\"createdAt\" : -1, \"updatedAt\" : -1}, s15 = \"TV_SHOW\" }",
              "stages": "[3] mkobj s23 [_id = s18, latestVersion = s19, versionCount = s21] true false \n[3] group [s18] [s19 = first((s9 ?: null)), s21 = sum(1)] spillSlots[s20, s22] mergingExprs[first((s20 ?: null)), sum(s22)] \n[3] project [s18 = (s11 ?: null)] \n[2] filter {(traverseF(s12, lambda(l1.0) { ((l1.0 == s15) ?: false) }, false) && (traverseF(s13, lambda(l2.0) { ((l2.0 == s16) ?: false) }, false) && traverseF(s14, lambda(l3.0) { isMember(l3.0, s17) }, false)))} \n[2] nlj inner [] [s4, s5, s6, s7, s8] \n    left \n        [1] ixseek KS(0F0F0104) KS(F5F5FE04) s7 s4 s5 s6 lowPriority [] @\"0e866e4f-1945-43b4-86fd-9fd9c3b32302\" @\"createdAt_-1_updatedAt_-1\" true \n    right \n        [2] limit 1 \n        [2] seek s4 s9 s10 s5 s6 s7 s8 [s11 = contentId, s12 = contentCategory, s13 = isDeleted, s14 = state] @\"0e866e4f-1945-43b4-86fd-9fd9c3b32302\" true false \n"
            }
          },
          "rejectedPlans": []
        }
      }
    },
    {
      "$replaceRoot": {
        "newRoot": {
          "$mergeObjects": [
            "$latestVersion",
            {
              "versionCount": "$versionCount"
            }
          ]
        }
      }
    },
    {
      "$lookup": {
        "from": "ContentMetaCategories",
        "as": "ContentMetaCategory",
        "localField": "_id",
        "foreignField": "contentId"
      }
    },
    {
      "$addFields": {
        "ContentMetaCategory": {
          "$map": {
            "input": "$ContentMetaCategory",
            "as": "meta",
            "in": {
              "$mergeObjects": [
                "$$meta",
                "$$meta.attributes"
              ]
            }
          }
        }
      }
    },
    {
      "$sort": {
        "sortKey": {
          "createdAt": -1
        }
      }
    },
    {
      "$project": {
        "sortByCustomKey": false,
        "convertedContentId": false,
        "ContentMetaCategory": {
          "attributes": false
        },
        "_id": true
      }
    },
    {
      "$facet": {
        "paginatedResults": [
          {
            "$internalFacetTeeConsumer": {}
          },
          {
            "$limit": 10
          }
        ],
        "queryDocumentCount": [
          {
            "$internalFacetTeeConsumer": {}
          },
          {
            "$group": {
              "_id": {
                "$const": ""
              },
              "queryCount": {
                "$sum": {
                  "$const": 1
                }
              }
            }
          },
          {
            "$project": {
              "queryCount": true,
              "_id": false
            }
          }
        ]
      }
    }
  ],
  "serverInfo": {
    "host": "Pradyumnas-MacBook-Pro.local",
    "port": 27017,
    "version": "7.0.2",
    "gitVersion": "02b3c655e1302209ef046da6ba3ef6749dd0b62a"
  },
  "serverParameters": {
    "internalQueryFacetBufferSizeBytes": 104857600,
    "internalQueryFacetMaxOutputDocSizeBytes": 104857600,
    "internalLookupStageIntermediateDocumentMaxSizeBytes": 104857600,
    "internalDocumentSourceGroupMaxMemoryBytes": 104857600,
    "internalQueryMaxBlockingSortMemoryUsageBytes": 104857600,
    "internalQueryProhibitBlockingMergeOnMongoS": 0,
    "internalQueryMaxAddToSetBytes": 104857600,
    "internalDocumentSourceSetWindowFieldsMaxMemoryBytes": 104857600,
    "internalQueryFrameworkControl": "trySbeEngine"
  },
  "command": {
    "aggregate": "Contents",
    "pipeline": [
      {
        "$match": {
          "state": {
            "$in": [
              "DRAFT",
              "PUBLISHED",
              "READY_FOR_PUBLISHING",
              "SCHEDULE_PUBLISH"
            ]
          },
          "isDeleted": false,
          "contentCategory": "TV_SHOW"
        }
      },
      {
        "$sort": {
          "createdAt": -1,
          "updatedAt": -1
        }
      },
      {
        "$group": {
          "_id": "$contentId",
          "latestVersion": {
            "$first": "$$ROOT"
          },
          "versionCount": {
            "$sum": 1
          }
        }
      },
      {
        "$replaceRoot": {
          "newRoot": {
            "$mergeObjects": [
              "$latestVersion",
              {
                "versionCount": "$versionCount"
              }
            ]
          }
        }
      },
      {
        "$lookup": {
          "from": "ContentMetaCategories",
          "localField": "_id",
          "foreignField": "contentId",
          "as": "ContentMetaCategory"
        }
      },
      {
        "$addFields": {
          "ContentMetaCategory": {
            "$map": {
              "input": "$ContentMetaCategory",
              "as": "meta",
              "in": {
                "$mergeObjects": [
                  "$$meta",
                  "$$meta.attributes"
                ]
              }
            }
          }
        }
      },
      {
        "$sort": {
          "createdAt": -1
        }
      },
      {
        "$project": {
          "ContentMetaCategory.attributes": 0,
          "sortByCustomKey": 0,
          "convertedContentId": 0
        }
      },
      {
        "$facet": {
          "paginatedResults": [
            {
              "$skip": 0
            },
            {
              "$limit": 10
            }
          ],
          "queryDocumentCount": [
            {
              "$count": "queryCount"
            }
          ]
        }
      }
    ],
    "cursor": {},
    "$db": "testingDb"
  },
  "ok": 1
}

It is at Atlas.

It is a cluster

steevej · March 27, 2024, 6:30pm

Please read Formatting code and log snippets in posts and update your code and sample documents so that we can easily experiment with.

John_Sewell · March 27, 2024, 9:49pm

You’re grouping by contentId but not sorting by that. Try adding an index that’ll cover the initial match and sort and groupl.

Pradyumna_Mishra · March 28, 2024, 5:06am

I want data to group by contentId and sort it with latest date. I have added group by key and sort key in index also.

Pradyumna_Mishra · March 28, 2024, 5:07am

Hope now you are able to read my query.

John_Sewell · March 28, 2024, 7:13am

You can run the explain query like this:

explain(“executionStats”)

Which will show the actual details of how it was executed, this will show details like how many documents entered a stage and keys hit / document fields that needed to be retrieved and can be useful to see possibly issues with queries.

Have you also tried commenting out stages to identify exactly where the issue lies?

If you want to group by ContentID and take the latest and make use of an index you’ll want to sort by the group field and the field you want to base the operation on, in the above case:
contentId, createdAt and updatedAt. If you sort by just the createdAt and UpdatedAt before you hit the group then the server cannot optimise the group based on the shape of data as it hits it.

Running a test collection locally with 1M docs and an index on:

categoryID:1
createDate:-1
updateDate:-1

The following takes 3 seconds or so:

db.getCollection('test2').aggregate([
{
    $sort:{
        createDate:-1,
        updateDate:-1
    }
},
{
    $group:{
        _id:'$categoryID',
        firstDoc:{$first:'$$ROOT'}
    }
},
]).toArray()

and the following takes a few ms:

db.getCollection('test2').aggregate([
{
    $sort:{
        categoryID:1,
        createDate:-1,
        updateDate:-1
    }
},
{
    $group:{
        _id:'$categoryID',
        firstDoc:{$first:'$$ROOT'}
    }
},
]).toArray()

In the second case, the $group stage can make use of the data as it enters as it’s in the best shape to group the data and the index was used to sort it.

In your case, you also have a $match before the sort, so adding those fields to the index should enable you to do at least the top part of the query quickly.

steevej · March 28, 2024, 12:36pm

Yes I can. Thanks for the update.