The thing I am trying to do, is to $group documents into different groups,
and in the meantime, get some other fields of each document.
For example, group by “location” (after $unwind so that each group key is
scalar instead of array). And then I want to also get the “color” of each
document.
{
_id: ObjectId("5f3dd02a1b4d50831f1334d2")
location: "SEA"
color: 'red'
}
{
_id: ObjectId("5f3dd02a1b4d50831f1336c6")
location: "SEA"
color: 'black'
}
{
_id: ObjectId("5f3dd02a1b4d50831f133413")
location: "WA"
color: 'red'
}
The example result will be as follows:
{ "group_id" : "SEA",
"other_related_fields" : [
{ "_id" : ObjectId("5f3dd02a1b4d50831f1334d2"), "color" : "red" },
{ "_id" : ObjectId("5f3dd02a1b4d50831f1336c6"), "color" : "black" },
]
},
{ "group_id" : "WA",
"other_related_fields" : [
{ "_id" : ObjectId("5f3dd02a1b4d50831f133413"), "color" : "red" },
]
}
The query I used is as follows:
db.pen.aggregate([
{ "$unwind" : { "path" : "$location", "preserveNullAndEmptyArrays" : true } },
{ "$group" : { "_id" : { "_1" : "$location" } } },
{ "$lookup":
{ "from" : "pen",
"let" : { "v1" : "$_id._1" },
"pipeline" : [
{ "$match" : { "$expr" : { "$eq" : [ "$$v1", "$location" ] } } },
{ "$project" : { "color" : 1 } },
{ "$limit" : 2 }
],
"as" : "other_field"
}
},
{$project: { "_id": 0, "group_id": "$_id._1", "other_field": "$other_field" }}
])
What it does is to do the group-by first, which generates a separate
document for each group. And then do a $lookup on the same collection as
to join with those temporary group-by documents by matching the group_id.
To match the group_id, I have to use the $expr.
This works perfectly for specific data set. For example, if the value of
“location” is scalar-only. However, the performance degraded heavily when
things get more complicated.
- If the value of “location” is array instead of scalar, there are 2 potential
solutions:
- 1.1) One way is to use the $expr + $in to find documents which have
“location” containing the value of the group_id. This solution turns
out to be very bad in performance. It might be related to the fact
that $expr + $in is not using index in $lookup pipeline.
And according to this document
(https://docs.mongodb.com/manual/core/index-multikey/#expr) $expr
does not support multikey indexes.
Following is the query:
Query 2
db.pen.aggregate([
{ "$unwind" : { "path" : "$location", "preserveNullAndEmptyArrays" : true } },
{ "$group" : { "_id" : { "_1" : "$location" } } },
{ "$lookup":
{ "from" : "pen",
"let" : { "v1" : "$_id._1" },
"pipeline" : [
{ "$match" : { "$expr" : { "$in" : [ "$$v1", "$location" ] } } },
{ "$project" : { "color" : 1 } }
],
"as" : "other_field"
}
},
{$project: { "_id": 0, "group_id": "$_id._1", "other_field": "$other_field" }}
])
-
1.2) Another way is to $unwind the “location” in the $lookup, and use the
$expr + $eq which works simiarly to the scalar type. (Query 3)However, this brings the problem that I cannot sort and limit the
results based on the ‘other_field’. This is because after $unwind,
the array field will be split into serveral individual documents.
As a result I’ll have duplicate_id
’s (or whatever unique key it is).
For example, if I have 2 documents:D1: {location:[SEA, WA]}, {color:black}, D2: {location: SEA}, {color:red}
After $unwind, I’ll get:
D1: {location:SEA}, {color:black} D1: {location:WA}, {color:black} D2: {loctation:SEA}, {color: red}
If I $sort on “color” and $limit:2 in the $lookup subpipeline, I’ll
get 2 D1’s instead of D1 and D2.
Query3
db.pen.aggregate([
{ "$unwind" : { "path" : "$location", "preserveNullAndEmptyArrays" : true } },
{ "$group" : { "_id" : { "_1" : "$location" } } },
{ "$lookup":
{ "from" : "pen",
"let" : { "v1" : "$_id._1" },
"pipeline" : [
{ "$unwind" : { "path" : "$location", "preserveNullAndEmptyArrays" : true } },
{ "$match" : { "$expr" : { "$eq" : [ "$$v1", "$location_sca" ] } } },
{ "$project" : { "color" : 1 } }
],
"as" : "other_field"
}
},
{$project: { "_id": 0, "group_id": "$_id._1", "other_field": "$other_field" }}
])
- Things get more complicated if “location” consists of both scalar and
array values. To take care of both types, I use an $or operator and check
the $type. (Basically a combination of scalar and array query).
The performance is much worse than the 1) situation. - If I need to do multi-dimensional grouping, the performance will be even
worse since the number of groups will be much more. But it should be
manageable if the index is properly being used by the query.
Thanks for reading my questions. To summarize:
- Is there any workaround for 1.1 (avoid using $expr + $in for array field)?
- Is there anyway to preprocess the result such that documents will be
unique in 1.2?