Lab - $facets -> What is it about?

After spending some time on this:

Chapter 4: Core Aggregation - Multidimensional Grouping
Lab - $facets

I did quite a bit of trial and errors to come to the conclusion that the following pipeline should work, giving us the base information needed along with the final solution. And looking at how the problem is formulated, I still think it is the way to go. It also complies to the reformulation found here.

db.movies.aggregate([
    {$facet: {
      imdb:[
          {$match:{"imdb.rating":{$gte:9.3}}},
          {$sort:{"imdb.rating":-1}},
          {$project:{_id:0,title:1,"imdb.rating":1}}
      ],
      metacritic:[
          {$match:{metacritic:{$gte:100}}},
          {$sort:{metacritic:-1}},
          {$project:{_id:0,title:1,metacritic:1}}
      ],
      imdb_metacritic:[
          {$match:{
              "imdb.rating":{$gte:9.3},
              metacritic:{$gte:100}
          }},
          {$project:{_id:0,title:1}}
      ]
    }
  }
]).pretty()

I am aware the number of elements in imdb and metacritic is going slightly over 10, but this is OK because looking at the values for imdb.rating and metacritic there is no way to decide on which element to eliminate. And more importantly the result number of elements in imdb_metacritic is 0.

But this answer is not in the list of choices.

After spending more time, reading some of the many discussion posts on the subject, I see that a number of people agree that this problem is nothing but clear. I have seen some hints, supposedly leading to “the solution”, but since they did not made much sense to me, I didn’t follow on that path.

For example, adding this to the pipeline I described above:

{$match: { "imdb.rating": { $exists: true } }},
{$match: { "metacritic": { $exists: true } }},

Regardless of knowing if it leads to “the solution” or not.

I don’t see on which ground I would be allowed to require:

{$match: { "imdb.rating": { $exists: true } }}

in order to later find the top 10 in the metacritic based classification.

And by the same token, why would one be allowed to require:

{$match: { "metacritic": { $exists: true } }}

in order to later find the top 10 in the imdb.rating based classification.

If someone can make sense of this lab, I hope to get some feedback.

1 Like

Hi Michel_Bouchet,
you need to:

  1. define the input sets for the $facet stage;
  2. extract the first 10 records with the highest imdb.ratings;
  3. extract the first 10 records with the highest metacritic, and
  4. project the records which are common to each of the above extractions.

In fact, the $exists command is not needed here in order to check the existence and then gather the records with highest ratings. You can use the $gte command to do the test right away. Documents not having the imdb.rating and/or metacritic fields would not be considered.

Before using the $facet you must define the input documents or subset of documents that will feed the $facet stage. So you need to specify the range of values for imdb.rating and metacritic in order to satisfy steps 2 and 3. There is no indication of a specific range in the lab statement so naturally you must consider all that is greater than zero.

Having both sets of records defined (steps 2 and 3) you can now use the $facet stage to gather the top 10 records with the highest values for imdb.rating and metacritic.

The last stage of the $aggregation pipeline will be the stage to select (project) the records from both sets of 10 documents having the same movie title.

Hope my explanation can clarify your doubts.
Jaime :coconut: :crab:

Yes, thanks a lot, that clarifies things, I will take another look at the issue with this new light.
But that also makes it clear that the problem is poorly stated from start.

Your important piece of information «Documents not having the imdb.rating and/or metacritic fields would not be considered.» makes it a different problem than what is presented in the lab.
(I presume the meaning here is “imdb.rating and metacritic”, in other words the document should have both to be considered).

You mention that «the $exists command is not needed …can use the $gte command to do the test». But this is not the point, the way the problem is presented, nothing should be needed.

The lab is presenting one problem and expects the answer to another one.

Anyway. Thanks again for your kind help!

I totally agree with you, the problem was poorly stated. I posted a rephrase:

ORIGINAL STATEMENT:
How many movies are in both the top ten highest rated movies according to
the imdb.rating and the metacritic fields? We should get these results with
exactly one access to the database.

REPHRASED:
How many movies the list of the top ten highest imdb.rating and the list
of the top 10 highest metacritic are common to both lists? We should get
these results with exactly one access to the database.

Documents not having the imdb.rating and/or metacritic fields would not be considered by any comparison command ($gt, $gte, $lt, $lte, etc) if the fields doest not exist inside a document, this is the way MongoDB works, this is basic knowledge. To check the existence of a field you use $exists. For the sake of the problem you do not need to consider using $exists because it is implicit that there are many documents having those fields so you may use any comparison command.

:stop_sign: Tip: Always know your data, that is, before coding a solution, do some analysis of the data you will be working on.

This is good to know that we agree, I hope the problem is going to be reformulated at some point. But to be honest and if you allow me to make a comment about your rephrasing I think it is still ambiguous. By two points:

  1. You write «and/or», this is ambiguous. AND and OR are different. Here what you want is clearly AND. In other words the documents not having imdb.rating field AND metacritic field, are not considered. Or if you prefer a different way to say the same thing:
    Documents not having imdb.rating field OR not having metacritic field, are not considered.

  2. You write «movies the list of the top ten highest imdb.rating and the list
    of the top 10 highest metacritic». The problem here is «movies the list of the top ten highest imdb.rating» does not imply anything about having a metacritic field or not; and this is exactly what makes the problem ambiguous. And of course the same thing can be said about «movies in the list of the top 10 highest metacritic» (it does not imply anything about having a imdb.rating field or not).

Here is IMHO, a much less ambiguous formulation:

Taking into account only movies being both imdb-rated and metacritic-rated; how many are in the top ten highest for the imdb.rating and also the top ten highest for the metacritic rating? We should get these results with exactly one access to the database.

2 Likes

Please note that “Documents not having the imdb.rating and/or metacritic fields would not be considered” is not ambiguos in the context of the explanation I was giving, but it is for the purpose of reformulating the problem statement. The $gte command and many others will just not consider the documents that do not have the argument field.

At some point during the processing you will have documents listed with both fields, documents with one field and documents with the other field. The final stage is where you will exract the documents with movies in common for both lists.

Your formulation looks good.

Another point and it should be added to the notes of the lab, is that a secondary sort field is required within each $facet . Without the second sort, you may get an intersection of 0.

I found that either title: 1 or the other rating field worked with my sample data.

e.g.

{“imdb.rating”: -1, metacritic: -1} in one facet
{metacritic: -1:, "imdb.rating: -1} in the other facet

Else use title: 1 as the secondary sort field.

Maybe it has something to do with my data in the Atlas cluster?

Hi @NeilM,

I would recommend updating the pipeline to use these sorts and then looking at individual results, before adding the $setIntersection stage in the pipeline.

The use of title field makes the sort consistent and it’s easier to find the movie by retaining this field in the pipeline.

Please feel free to reach out if you have any questions.

Kind Regards,
Sonali