How to optimize indices for aggregate queries

I have a bunch of queries using the aggregation pipeline which now generate alerts on Atlas because the execution scans a lot of documents.

I haven’t been able to locate any good documentation for how to optimize indices for aggregations so I was hoping that someone in this forum can help.

My pipeline looks more or less like this for a document that has two level of subdocuments (yeah maybe not the best modeling in the world):

[
{ $match: level-1-selector},
{ $unwind: "$level1Field" },
{ $match: level-2-selector},
{ $unwind: "$level2Field" },
 {$match: level-3-selector}
]

so my specific question is how to create indices for the 3 different selectors? Does mongo merge the queries into one or should I create three different indices for the different $match-es?

If I remember correctly the lessons from the course M121 of MongoDB university, the indexes can only be used for the level-1-selector.

Thanks, so that means I could build a better query by including the sub-selectors in the first match?

[
{ $match: {...level-1-selector,...level-2-selector, ...level-3-selector} },
{ $unwind: “$level1Field” },
{ $match: level-2-selector},
{ $unwind: “$level2Field” },
{$match: level-3-selector}
]

If you can do your level-2(and 3)-selector) in the first $match it means they do not need the result of $unwind to do the filtering. Then you probably do not need they other $match stages. In practice you try to reduce the working set as soon as possible as it requires less memory. Even $project-ing in the used fields will help as less memory is used.

1 Like

Lets consider a document structure with arrays with sub-documents as elements. For example, a collection with these two documents:

{
        "_id" : 1,
        "fld1" : "str-1",
        "fld2" : [
                {
                        "a" : 1,
                        "b" : [
                                {
                                        "x" : 1,
                                        "y" : 2
                                },
                                {
                                        "x" : 11,
                                        "y" : 22
                                }
                        ]
                },
                {
                        "a" : 2,
                        "b" : [
                                {
                                        "x" : 111,
                                        "y" : 222
                                }
                        ]
                }
        ]
},
{
        "_id" : 2,
        "fld1" : "str-2",
        "fld2" : [
                {
                        "a" : 9,
                        "b" : [
                                {
                                        "x" : 91,
                                        "y" : 92
                                },
                                {
                                        "x" : 911,
                                        "y" : 922
                                }
                        ]
                }
        ]
}

The Query:

In an aggregation query, the $match+$project stage filtering and $match+$unwind+$match filtering work differently on a document. But, can produce similar output.

The following query is using $match+$unwind+$match filter. Note the output:

db.nests2.aggregate( [
  { $match: { fld1: "str-1" } },
  { $unwind: "$fld2" },
  { $match: { "fld2.a": 1 } },
] ).pretty()

{
        "_id" : 1,
        "fld1" : "str-1",
        "fld2" : {
                "a" : 1,
                "b" : [
                        {
                                "x" : 1,
                                "y" : 2
                        },
                        {
                                "x" : 11,
                                "y" : 22
                        }
                ]
        }
}

The following query is using $match (filter on multiple levels) + $project (using $addFields in the example).

db.nests2.aggregate( [
  { $match: { fld1: "str-1",  "fld2.a": 1 } },
  { $addFields: { fld2: {
                           $filter: {
                               input: "$fld2",
                               cond: {
                                   $eq: [ "$$this.a", 1 ]
                               }
                           }
  } } }
] ).pretty()

{
        "_id" : 1,
        "fld1" : "str-1",
        "fld2" : [
                {
                        "a" : 1,
                        "b" : [
                                {
                                        "x" : 1,
                                        "y" : 2
                                },
                                {
                                        "x" : 11,
                                        "y" : 22
                                }
                        ]
                }
        ]
}

Indexes:

If a compound index is defined on the keys { fld1 : 1, "fld2.a" : 1 }, this index is applied on both the queries. But, on the first $match stage only. This can be verified by generating a query plan using explain() on both queries.

( [EDIT ADD]: Even if an index is created on the single field { "fld2.a": 1 } it will not be used by the query).

Same goes as we go down multiple levels down. For example in the following case with $match and the index:

{ $match: { fld1: "str-1", "fld2.a": 1, "fld2.b.y": 22 } }
and the index, { fld1 : 1, "fld2.a" : 1, "fld2.b.y": 1 }

NOTES:

In general, indexes on the document fields are best utilized when the aggregation stages that use the index come early in the pipeline (mostly as the first stage as in this case). The stages $match and $sort use the indexes for performance. Also see documentation: Aggregation Pipeline Optimization.

Note that creating indexes on arrays (a.k.a. multikey indexes ) on large array fields can affect performance, as the indexes can grow very large.

1 Like