Problem with $expr on multikey indexes

The thing I am trying to do, is to $group documents into different groups,
and in the meantime, get some other fields of each document.
For example, group by “location” (after $unwind so that each group key is
scalar instead of array). And then I want to also get the “color” of each
document.

{
  _id: ObjectId("5f3dd02a1b4d50831f1334d2")
  location: "SEA"
  color: 'red'
}
{
  _id: ObjectId("5f3dd02a1b4d50831f1336c6")
  location: "SEA"
  color: 'black'
}
{
  _id: ObjectId("5f3dd02a1b4d50831f133413")
  location: "WA"
  color: 'red'
}

The example result will be as follows:

{ "group_id" : "SEA", 
  "other_related_fields" : [ 
    { "_id" : ObjectId("5f3dd02a1b4d50831f1334d2"), "color" : "red" }, 
    { "_id" : ObjectId("5f3dd02a1b4d50831f1336c6"), "color" : "black" }, 
  ]
},
{ "group_id" : "WA", 
  "other_related_fields" : [ 
    { "_id" : ObjectId("5f3dd02a1b4d50831f133413"), "color" : "red" }, 
  ]
}

The query I used is as follows:

db.pen.aggregate([
{ "$unwind" : { "path" : "$location", "preserveNullAndEmptyArrays" : true } },
{ "$group" : { "_id" : { "_1" : "$location" } } },
{ "$lookup":
  { "from" : "pen",
    "let" : { "v1" : "$_id._1" },
    "pipeline" : [
      { "$match" : { "$expr" : { "$eq" : [ "$$v1", "$location" ] } } },
      { "$project" : { "color" : 1 } },
      { "$limit" : 2 }
    ],
    "as" : "other_field"
  }
},
{$project: { "_id": 0, "group_id": "$_id._1", "other_field": "$other_field" }}
])

What it does is to do the group-by first, which generates a separate
document for each group. And then do a $lookup on the same collection as
to join with those temporary group-by documents by matching the group_id.
To match the group_id, I have to use the $expr.

This works perfectly for specific data set. For example, if the value of
“location” is scalar-only. However, the performance degraded heavily when
things get more complicated.

  1. If the value of “location” is array instead of scalar, there are 2 potential
    solutions:
  • 1.1) One way is to use the $expr + $in to find documents which have
    “location” containing the value of the group_id. This solution turns
    out to be very bad in performance. It might be related to the fact
    that $expr + $in is not using index in $lookup pipeline.
    And according to this document
    (https://docs.mongodb.com/manual/core/index-multikey/#expr) $expr
    does not support multikey indexes.
    Following is the query:
Query 2
db.pen.aggregate([
{ "$unwind" : { "path" : "$location", "preserveNullAndEmptyArrays" : true } },
{ "$group" : { "_id" : { "_1" : "$location" } } },
{ "$lookup":
  { "from" : "pen",
    "let" : { "v1" : "$_id._1" },
    "pipeline" : [
      { "$match" : { "$expr" : { "$in" : [ "$$v1", "$location" ] } } },
      { "$project" : { "color" : 1 } }
    ],
    "as" : "other_field"
  }
},
{$project: { "_id": 0, "group_id": "$_id._1", "other_field": "$other_field" }}
])
  • 1.2) Another way is to $unwind the “location” in the $lookup, and use the
    $expr + $eq which works simiarly to the scalar type. (Query 3)

    However, this brings the problem that I cannot sort and limit the
    results based on the ‘other_field’. This is because after $unwind,
    the array field will be split into serveral individual documents.
    As a result I’ll have duplicate _id’s (or whatever unique key it is).
    For example, if I have 2 documents:

        D1: {location:[SEA, WA]}, {color:black},
        D2: {location: SEA}, {color:red}
    

    After $unwind, I’ll get:

        D1: {location:SEA}, {color:black}
        D1: {location:WA}, {color:black}
        D2: {loctation:SEA}, {color: red}
    

    If I $sort on “color” and $limit:2 in the $lookup subpipeline, I’ll
    get 2 D1’s instead of D1 and D2.

Query3
db.pen.aggregate([
{ "$unwind" : { "path" : "$location", "preserveNullAndEmptyArrays" : true } },
{ "$group" : { "_id" : { "_1" : "$location" } } },
{ "$lookup":
  { "from" : "pen",
    "let" : { "v1" : "$_id._1" },
    "pipeline" : [
      { "$unwind" : { "path" : "$location", "preserveNullAndEmptyArrays" : true } },
      { "$match" : { "$expr" : { "$eq" : [ "$$v1", "$location_sca" ] } } },
      { "$project" : { "color" : 1 } }
    ],
    "as" : "other_field"
  }
},
{$project: { "_id": 0, "group_id": "$_id._1", "other_field": "$other_field" }}
])
  1. Things get more complicated if “location” consists of both scalar and
    array values. To take care of both types, I use an $or operator and check
    the $type. (Basically a combination of scalar and array query).
    The performance is much worse than the 1) situation.
  2. If I need to do multi-dimensional grouping, the performance will be even
    worse since the number of groups will be much more. But it should be
    manageable if the index is properly being used by the query.

Thanks for reading my questions. To summarize:

  1. Is there any workaround for 1.1 (avoid using $expr + $in for array field)?
  2. Is there anyway to preprocess the result such that documents will be
    unique in 1.2?

I understand the question is a bit too long, but it is really important to me.
@slava would you please kindly share some thoughts on this ^^?

Hello, @Peng_Huang! Welcome to the community!

It seems, that you’ve over-complicated the things a bit :smiley:
Let me show you how it can be done in a more simpler way.

So, we have a sample dataset (with arrays and scalars):

db.test1.insertMany([
  {
    _id: 101,
    location: ['SEA', 'WA'],
    color: 'red',
  },
  {
    _id: 102,
    location: ['SEA'],
    color: 'blue',
  },
  {
    _id: 103,
    location: 'WA', // scalar
    color: 'green',
  },
]);

And if we need to get the result, similar to this:

[
  {
    "_id" : "SEA",
    "docs" : [
      { "_id" : 101, "color" : "red" },
      { "_id" : 102, "color" : "blue" }
    ]
  },
  {
    "_id" : "WA",
    "docs" : [
      { "_id" : 101, "color" : "red" },
      { "_id" : 103, "color" : "green" }
    ]
  }
]

We can use this simple aggregation pipeline:

db.test1.aggregate([
  {
    $unwind: '$location',
  },
  {
    $group: {
      _id: '$location',
      docs: {
        $push: {
          // mention here only the props,
          // that you need from each document
          _id: '$_id',
          color: '$color',
        },
      },
    },
  },
]).pretty();

Also, if you know, that your field will contain one or more elements, better to make it an array initially (or migrate it later) - this way it will be easier to write aggregations (no need to use conditionals for field type, for example) and perform update operations.

Hi Slava,

Thanks for replying! And yes, using the $push is perfect for some scenarios. The reason I did not use $push is that I need to sort and limit on the other fields (see the point 1.2). In our example, if I sort on the “color” and limit to 2 using $push, I need to get all the documents, and then do the sort and limit myself.
So it would be good if there is something I leverage in MongoDB. That’s why I get to the $lookup stage.

@Peng_Huang, it would be much easier for me to understand your problem, if you:

  • take my example dataset and extend it, so it would contain all your edge cases
  • provide the desired example of the output, that you want to get from an aggregation
1 Like