Problem Description:
We are using mongoDB Community Version 4.0.14 version as our database.
HW sizing is 16 core CPU, 32 GB RAM and 1 TB HDD.
We have data stored in two collections coll_A ( from collection - 2GB) and coll_B (to collection 5 GB) on a single node mongodb server,
We have to join these two collection based on certain predefined conditions and out the data in third collection coll_result.
Appropriate Indexes are build on both the collections.
Each of the collection contains millions of records and we use the following pipeline stages to out the result records.
#1. $match runID (job id)
#2. $redact on multiple and / or conditions filter initial data and keep relevant
#3. $lookup on multiple and / or conditions join on the second collection
#4. $unwind result
#5. $redact on multiple and / or conditions filter all the data
#6. $match specific condition for system from which the data is polled. ( Data is polled from multiple downstream systems)
#7. $addFields
#8. $project
#9. $project with cursor - batchsize 50 , allowDiskUse: true
We have a few observations:
- There happens to be a cartesian product of the records from both the collections at stage #3 $lookup.
- After this lookup indexes are not applied on the resultant data.
- This operation essentially uses all the CPU processing power of 16 core CPU.
- HDD and SDD makes no difference on i/o operation resultant time.
- Altering cursor batch size has no impact on the processing.
Some time it takes more the 3 to 6 hours to get the output.
As the data in coll_A , coll_B increases the time keeps on increasing.
Need guidance for how to resolve the long query time problem.
Thanks & Regards