Lab 2.2 Possibly incorrect answer

It seems there must be 5-th index, not 2-nd. With 5-th index the 2-nd query will work much faster:
actually, for 2-nd query with 2-nd index the only “address.state” of index will be used while with 5-th index both “address.state” and “job” fields of index will be used.
Also, if we look at counters on test set for this combination :
index 2 query 2 : 8,488,67 (Documents Returned, Index Keys Examined, Documents Examined)
index 5 query 2 : 8, 69, 67
Other queries for this indexes are almost equal

@ Andrey_97884

I’m sorry, but I’m having some problems following your question. If i understand you, you are saying that one of the proposed solutions (5th???) works better for the second query than any other index. However, the problem is to create one index for all three example queries that optimizes over all these queries, not just one.

In particular, the queries do not all look at the same fields, and the second query has a sort on ‘last_name’, which has significant impact on the choice, but the others do not.

I understand that the task is to create index that will be best for all 3 queries.
Here are counters for 2-nd and 5-th indexes for all queries from Compass dashboard execution plan (Documents Returned, Index Keys Examined, Documents Examined):
Index/Query
2/1 - 2,31,2
2/2 - 8,488,67
2/3 - 96, 716, 716

5/1 - 2,3,3
5/2 - 8, 69, 67
5/3 - 96, 716, 716

According to this counters from m201 database index 5 is better at all.
Moreover, for index first operation is filtering, then must be sorting. I.e., for query 2 we filter with two index fields for 5-th index but with 1 field with 2-nd index. I’m agree that for 2-nd index there will be sorting with index but it will be implemented on the set that is much larger than result to be sorted for 5-th index. That’s why it’s not quiet obvious for me… Moreover, including counters received on m201 database.

Probably there will be different result on the other data set (or with different parameter values), I’m not sure.
Please, could you explain why 2-nd index is better according to counters provided above ?
Probably I don’t understand task completely (maybe query speed must be estimated at all but not for this current exact dataset - for some parameters and different dataset index 2 probably will be faster is there will be selected much more data)

1 Like

@ Andrey_97884

You keep mentioning “5-th index” – but on my system the 5th entry (counting from 1 as the first) is

{ "job": 1, "address.state": 1, "first_name": 1 }

which obviously cannot be used by all the queries. So that cannot be the answer.

Are we talking about different indexes here??? LMK.

I see indexes in the following order :

  1. { “job”: 1, “address.state”: 1, “last_name”: 1 }
  2. { “address.state”: 1, “last_name”: 1, “job”: 1 }
  3. { “job”: 1, “address.state”: 1, “first_name”: 1 }
  4. { “job”: 1, “address.state”: 1 }
  5. { “address.state”: 1, “job”: 1 }
  6. { “address.state”: 1, “job”: 1, “first_name”: 1 }

Grading system shows that correct answer is 2. All my comments above are related to this order (This order is in the task. If there is some misunderstanding - sorry for this, I haven’t supposed that this items are displayed in the different order for different users)

@Andrey_97884

Ahh… that helps a lot. Thanks.

Why would you think that the 5th index (on your list) works at all for the second query?

db.people.find({
“job”: /^P/,
“first_name”: /^C/,
“address.state”: “Indiana”
}).sort({ “last_name”: 1 })

As you can see, the 2nd query has the shape “job.first_name.address.state” and a sort on “last_name”. Based on what you know, you can see that this index does not fit that shape, and requires an in-memory sort.

I suspect that you are much too focused on the documents numbers you show, and are not looking here at the system-wide requirements. Remember that, in real production systems, the number of documents often runs into the multiple millions, so an in-memory sort is very, very problematic. Also remember that the query analyzer can re-order predicates to improve performance. HTH.

Also, just BTW, I developed multiple testing web sites for a major international company and one of the key requirements was that the answers are always displayed in a different order for different users – just to keep folks from “copying” one another. :wink:

1 Like

Thanks for explanation !
I agree that for any data index 2 is preferrable at all. I was a little bit confused with the task - I thought that this task is focused first of all on analisys of exact query execution plans but as mentioned above it checks common index theory.

Hi,
Actually I analyzed the following index:
{ “address.state”: 1, “job”: 1, “first_name”: 1 }
And it looks from explain plan as a better option for all 3 queries:

Index 1: { “address.state”: 1, “last_name”: 1, “job”: 1 }
Index 2: { “address.state”: 1, “job”: 1, “first_name”: 1 }

Query db.people.find({
“address.state”: “Nebraska”,
“last_name”: /^G/,
“job”: “Police officer”
}).explain(“executionStats”)

With Index 1:
“nReturned” : 2,
“executionTimeMillis” : 2,
“totalKeysExamined” : 31,
“totalDocsExamined” : 2,

With Index 2:
“nReturned” : 2,
“executionTimeMillis” : 2,
“totalKeysExamined” : 3,
“totalDocsExamined” : 3,

Query db.people.find({
“job”: /^P/,
“first_name”: /^C/,
“address.state”: “Indiana”
}).sort({ “last_name”: 1 }).explain(“executionStats”)

With Index 1:
“nReturned” : 8,
“executionTimeMillis” : 5,
“totalKeysExamined” : 488,
“totalDocsExamined” : 67,

With Index 2:
“nReturned” : 8,
“executionTimeMillis” : 1,
“totalKeysExamined” : 59,
“totalDocsExamined” : 8,

Query db.people.find({ “address.state”: “Connecticut”,
“birthday”: {
“$gte”: ISODate(“2010-01-01T00:00:00.000Z”),
“$lt”: ISODate(“2011-01-01T00:00:00.000Z”)
}
}).explain(“executionStats”)

With Index 1:
“nReturned” : 96,
“executionTimeMillis” : 6,
“totalKeysExamined” : 716,
“totalDocsExamined” : 716,

With Index 2:
“nReturned” : 96,
“executionTimeMillis” : 5,
“totalKeysExamined” : 716,
“totalDocsExamined” : 716,

It is correct that for this particular query sorting is done in memory, but due to precise indexing the amount of sorted documents should not be that large (8 documents in the people collection).

So, my question is whether we should always strive to prevent in memory sorting, or we can take into consideration the amount of documents that will eventually be sorted?

Thanks.

1 Like

@ Regina_20763

As a general rule, you never, ever want to rely on in-memory sorts in a production environment. We’ll cover this more in Chapter 4.

But just to give you some flavor, one typical client of mine has a database with over 50M client documents averaging ~10K each and growing at ~5-6% per month; the people collection here has only ~50K documents with only 11 fields. Believe me, the last thing you want to risk in a production environment is (a) running out of memory or (b) pushing your working set out of memory for a sort.

4 Likes