Multiple index vs single index in multiple columns

Hello guys.I am using bucket pattern for time-series.My data look like this:

{'_id': ObjectId('6068da8878fa2e568c42c7f1'),
 'first': datetime.datetime(2018, 1, 24, 14, 5),
 'last': datetime.datetime(2018, 1, 24, 15, 5),
 'maxid13': 12.5,
 'minid13': 7.5,
 'nsamples': 13,
 'samples': [{'c14': 'C',
              'id1': 3758.0,
              'id10': 0.0,
              'id11': 274.0,
              'id12': 0.0,
              'id13': 7.5,
              'id15': 0.0,
              'id16': 73.0,
              'id17': 0.0,
              'id18': 0.342,
              'id19': 6.3,
              'id20': 1206.0,
              'id21': 0.0,
              'id22': 0.87,
              'id23': 0.0,
              'id6': 2.0,
              'id7': -79.09,
              'id8': 35.97,
              'id9': 5.8,
              'timestamp1': datetime.datetime(2018, 1, 24, 14, 5),
              'timestamp2': datetime.datetime(2018, 1, 24, 9, 5)},
             {'c14': 'C',
              'id1': 3758.0,
              'id10': 0.0,
              'id11': 288.0,
              'id12': 0.0,
              'id13': 8.4,
              'id15': 0.0,
              'id16': 71.0,
              'id17': 0.0,
              'id18': 0.342,
              'id19': 6.3,
              'id20': 1207.0,
              'id21': 0.0,
              'id22': 0.69,
              'id23': 0.0,
              'id6': 2.0,
              'id7': -79.09,
              'id8': 35.97,
              'id9': 6.2,
              'timestamp1': datetime.datetime(2018, 1, 24, 14, 10),
              'timestamp2': datetime.datetime(2018, 1, 24, 9, 10)},
               .
               .
               .
               .

In the most cases in the match stage i use first with samples.id13 andsamples.id9 and in some other queries i use samples.timestamp1 with samples.id13 and samples.id9 ,and in sort stage sometimes i use first.Should i create 4 single index on them or a compound(with the correct order)?

Hi @harris,

Compound index are always better versus index intersections which are usually not well ranked by the query planner and usually any other solution is preferred by the query planner.

I don’t know which query parameters here are a range or an equality so I can’t make a proposition. But I’d recommend to follow the ESR rule to make sure that you avoid in-memory sorts and scan as little index entries as possible.

Cheers,
Maxime.

Thank you for you help! @MaBeuLux88
So for example if we have this query:

cursor = mydb1.mongodbbuckethour.aggregate([

    {
        "$match": {
            "first": {"$gte": datetime.strptime("2010-01-01 00:05:00", "%Y-%m-%d %H:%M:%S"),
                          "$lte" :datetime.strptime("2020-12-31 23:05:00", "%Y-%m-%d %H:%M:%S")}
        }
    },
    { "$unwind": "$samples" },
{
        "$match": {
            "first": {"$gte": datetime.strptime("2010-01-01 00:05:00", "%Y-%m-%d %H:%M:%S"),
                          "$lte" :datetime.strptime("2020-12-31 23:05:00", "%Y-%m-%d %H:%M:%S")}
        }
    },

{
        "$group": {
            "_id": {"$dateToString": { "format": "%Y-%m-%d ", "date": "$first" }},


          "avg_id13": {
            "$avg": "$samples.id13"
          }
        }
      },
    {
        "$project": {
            "_id": 0,
            "day":"$_id",
            "avg_id13": 1
        }
    },
    {"$sort": {"day": -1}}
])

So with the esr rule i should create an single index on first because i use him in sort and also in range?

And for this query:

cursor = mydb1.mongodbbuckethour.aggregate([

    {
        "$match": {
            "first": {"$gte": datetime.strptime("2010-01-01 00:05:00", "%Y-%m-%d %H:%M:%S"),
                          "$lte" :datetime.strptime("2020-12-31 23:05:00", "%Y-%m-%d %H:%M:%S")},
            "samples.id13": {
                "$gt": 5
            }

        }
    },
    { "$unwind": "$samples" },
{
        "$match": {
            "first": {"$gte": datetime.strptime("2010-01-01 00:05:00", "%Y-%m-%d %H:%M:%S"),
                          "$lte" :datetime.strptime("2020-12-31 23:05:00", "%Y-%m-%d %H:%M:%S")},
            "samples.id13": {
                "$gt": 5
            }
        }
    },

{
        "$group": {
            "_id": {"$dateToString": { "format": "%Y-%m-%d ", "date": "$first" }},


          "avg_id13": {
            "$avg": "$samples.id13"
          }
        }
      },
    {
        "$project": {
            "_id": 0,
            "day":"$_id",
            "avg_id13": 1
        }
    },
    {"$sort": {"day": -1}}
])

It should be better to use an compound index on samples.id13 for equality and first for sort and also in range?

In the first pipeline, the 3rd stage ($match) is redundant with the first one. It’s not removing any other doc from the pipeline after the $unwind stage.

In the second pipeline though, it makes sense because the first $match is checking that at least one of the sub documents in the samples array has one id13 > 5 (so you have less docs to unwind). Then after the $unwind, you want to eliminate all the sub docs that don’t respect this contraints, so you have to repeat that condition which this time is applied on all the (sub) docs. So for the second pipeline match (date, id13) => unwind => match (id13). Repeating the date filter doesn’t add anything after the unwind as all the docs at this stage at already in the range.

Now, I didn’t know you were using the aggregation pipeline and the use of indexes is actually a bit different here.

In your case, once you have passed the $unwind stage, you can’t use indexes anymore. So index on {"first":1, "samples.id13": 1} is the optimal index. Both queries can use it. The first pipeline will only use the first part though.

I would also recommend that the $project should always be the very last stage. If done earlier in the pipeline, it could prevent automated optimization.

Cheers,
Maxime.

1 Like