Why does one of these queries take minutes longer to resolve?

Why does one of these queries take minutes longer to resolve?

  • One takes minutes
  • One takes seconds

    db.collection.aggregate([
                {$match: {userId: id}},
                {$sort: {date: -1}},
                {$skip: skip},
                {$limit: limit},
                {
                    "$lookup": {
                        localField: "targetId",
                        foreignField: "_id",
                        from: 'users',
                        as: 'target',
                    }
                },
                {
                    "$project": {
                        'actorId': 1,
                        'date': 1,
                        'description': 1,
                        'points': 1,
                        'targetId': 1,
                        'userId': 1,
                        '_id': 1,
                        'target._id': 1,
                        'target.emails.address': 1,
                    }
                }])

    db.collection.aggregate([
                {$match: {userId: id}},
                    {$sort: {date: -1}},
                    {$skip: skip},
                    {$limit: limit},
                    {
                        $lookup: {
                            from: 'users',
                            pipeline: [
                                {
                                    $match: {
                                        $expr: {$eq: ["$_id", "$targetId"]},
                                    }
                                },
                                {
                                    $project: {
                                        _id: 1,
                                        'emails.address': 1,
                                    }
                                }
                            ],
                            as: 'target',
                        }
                    }
                ])

Hello @Zack_Beyer, it is little difficult to tell just by looking at the queries. I suggest you include some more details so that we can take a closer look at your question.

  1. Tell what indexes are defined on each of the the collections used in the aggregation.
  2. How many documents are there in each collection.
  3. Include sample document for each of the collections.
  4. The MongoDB version.

Take a look at this topic from the documentation; it has information about how an aggregation query is optimized to perform and suggestions for doing so: Aggregation Pipeline Optimization.

Finally, do generate a query plan using the explain method (the executionStats mode) on the aggregation query - the plan will tell how the indexes are used and what stage of the query has issues regarding the performance. And, include the generated plans also in your reply.

1 Like

Zack is a colleague of mine. I reviewed his code and I cannot justify the behavior. His post is much too vague. I’ll provide some extra details.

His aggregation is selecting docs from a collection of ~200K records keyed on “userId”. This typically returns 0-5 docs. It then does a $lookup to pull in the several “user” docs from the “users” collection that match the “targetId” which is the “_id” foreign key into the user collection. The users collection has ~10MM docs.

When the lookup is done thusly:

"$lookup": {
    localField: "targetId",
    foreignField: "_id",
    from: 'users',
    as: 'target',
}

It is fast. It “does the right thing” uses the _id index and returns FULL user docs for each match user.

When the lookup is done differently, to try and “optimize” the data SIZE and project the matched docs down to fewer fields, the query does NOT use the _id index and appears to be doing an index scan.

    $lookup: {
        from: 'users',
        pipeline: [
            {
                $match: {
                    $expr: {$eq: ["$_id", "$targetId"]},
                }
            },
            {
                $project: {
                    _id: 1,
                    'emails.address': 1,
                }
            }
        ],
        as: 'target',
    }

So, the only purpose for this latter syntax is to exploit a “projection” clause to limit to the only 2 user fields of interest.

My guess is this clause is the offending code.

                    $match: {
                        $expr: {$eq: ["$_id", "$targetId"]},
                    }

But, we could not simplify this clause. Nor can I understand why $expr of $eq on an indexed field is so slow.

Zack could not get any simplified version like this match query to work.
$match: {_id:"$targetId"}

Thanks!

Welcome to community Eric,
Aren’t you supposed to define "targetId" in a let field. Otherwise mongodb will look for "targetId" in the from collection i.e. users in your case. Thus the delay caused by the fetch operation. I wonder how this pipeline gives the right documents. You should try:

 $lookup: {
        from: "users",
        let: { id : "$targetId" },
        pipeline: [
            {
                $match: {
                    $expr: {$eq: ["$_id", "$$id"]},
                }
            },
            {
                $project: {
                    _id: 1,
                    'emails.address': 1,
                }
            }
        ],
        as: 'target',
    }

but nothing can be said for sure without the explain output

Regards,

1 Like

Zack used the let syntax in his original version… and then removed it as he tried to simplify the code. I believe the query worked in both cases. I do know that the performance was terrible, which is what started the investigation in the first place.

I think he also used the let syntax for the “simplified” match clause. No version of the “simplified” match ever gave correct results from the lookup. Although now that you mention the implications of the let… I am thinking he should give it another go with

match {_id:"$$id"}

But I do think he tried that syntax too. He showed me quite a few versions of code.

Regardless… how do we explain the terrible runtime of the lookup with a pipeline?

db.collection.explain().aggregate(<your pipeline>)

or better, following @Prasad_Saya’s suggestion:

db.collection.explain("executionStats").aggregate(<your pipeline>)

ps.

you can’t access variables defined in let without $expr. and it must be $$id to be interpreted as a variable. $id is interpreted as a field of the from collection

1 Like

Hello @Eric_Oemig,

The two code snippets of the $lookup stage are:

Case 1:

$lookup: {
    localField: "targetId",
    foreignField: "_id",
    from: 'users',
    as: 'target',
}

Case 2:

$lookup: {
    from: 'users',
    pipeline: [
        {
            $match: {
                $expr: {$eq: ["$_id", "$targetId"]},
            }
        },
        {
            $project: {
                _id: 1,
                'emails.address': 1,
            }
        }
    ],
    as: 'target',
}

The usage syntax of the Case 2 is not correct (as @Imad_Bouteraa has pointed in his reply earlier). Why, it is not correct?

This is because you cannot use the collection document field in the pipeline's $match stage of the $lookup - directly. You need to assign the field (the targetId) to a variable using the let, and then use that variable in the $match stage. It is a rule as described in Join Conditions and Uncorrelated Sub-queries and this example in the documentation. So, the correct syntax would be as follows:

$lookup: {
    from: 'users',
    let: { targetIdVar: "$targetId" },
    pipeline: [
        {
            $match: {
                $expr: { $eq: [ "$_id", "$$targetIdVar" ] },
            }
        },
        {
            $project: {
                'emails.address': 1,
            }
        }
    ],
    as: 'target',
}

Also, in the pipeline you have a $project stage. In thiis projection,

$project: {
    _id: 1,
    'emails.address': 1,
}

the _id: 1 is not required, as the _id field is selected automatically.


So, the query is looking for a non-existing variable in the '$match' stage and so the related problems, I think. Please run the query with correct syntax, as shown above, and let us know what are the results.