MongoDB.live, free & fully virtual. June 9th - 10th. Register Now

Queries take way too long for processing sometimes days and return no result

Problem Description:

We are using mongoDB Community Version 4.0.14 version as our database.
HW sizing is 16 core CPU, 32 GB RAM and 1 TB HDD.
We have data stored in two collections coll_A ( from collection - 2GB) and coll_B (to collection 5 GB) on a single node mongodb server,
We have to join these two collection based on certain predefined conditions and out the data in third collection coll_result.

Appropriate Indexes are build on both the collections.
Each of the collection contains millions of records and we use the following pipeline stages to out the result records.
#1. $match runID (job id)
#2. $redact on multiple and / or conditions filter initial data and keep relevant
#3. $lookup on multiple and / or conditions join on the second collection
#4. $unwind result
#5. $redact on multiple and / or conditions filter all the data
#6. $match specific condition for system from which the data is polled. ( Data is polled from multiple downstream systems)
#7. $addFields
#8. $project
#9. $project with cursor - batchsize 50 , allowDiskUse: true

We have a few observations:

  1. There happens to be a cartesian product of the records from both the collections at stage #3 $lookup.
  2. After this lookup indexes are not applied on the resultant data.
  3. This operation essentially uses all the CPU processing power of 16 core CPU.
  4. HDD and SDD makes no difference on i/o operation resultant time.
  5. Altering cursor batch size has no impact on the processing.

Some time it takes more the 3 to 6 hours to get the output.
As the data in coll_A , coll_B increases the time keeps on increasing.

Need guidance for how to resolve the long query time problem.

Sample aggregation pipeline

Thanks & Regards

1 Like

Hi @Ranjit_Kolte, welcome!

You could try to simplify the multiple $or on stage #2. For example, it could be written similarly as:

$cond: [ 
    { $or: [ 
        { $eq:[ "$Masterid", "" ]}, 
        { $eq:[ "$VKORG", "" ]}, 
        { $eq:[ "$VTWEG", "" ]}, 
        { $eq:[ "$SPART", "" ]}, 
        { $eq:[ "$KUNNR", "" ]}, 
        { $eq:[ "$VBELN", "" ]}, 
        { $eq:[ "$POSNR", "" ]}, 
        { $eq:[ "$UDATE", "" ]} 
    ]}, 
"$$PRUNE", "$$KEEP" ]

Similarly in the other parts of the pipeline. For example, in the $lookup pipeline match you could simplify the nested $and as well. Looking at the $lookup stage it looks like it could benefit from a single reference field. This should simplify matching ~12 fields into one field.
Also, the $cond on stage #7 is likely redundant, as on stage #6, the $match should already filtered cases for the null condition.

Having said all the above, I’d recommend to reconsider the schema of the collections. Especially to look for ways to remove the double $redact stages, and potentially simplifying/eliminating the $lookup stage. Please see Building With Patterns: A Summary to see how different schema patterns may be suitable for your use case.

I would also suggest to use explain to view detailed information regarding the execution plan of the aggregation pipeline. This should help you to create the appropriate indexes. See also Return Information on Aggregation Pipeline operation.

Regards,
Wan.

1 Like