Watch keynotes and sessions from MongoDB.live, our virtual developer conference.

Group on two fields that point to same collection

I have a collection Sale, with these fields:

{
  buyer,
  seller,
  // ...lots of other fields, like:
  project,
  product,
  creationDate,
}

Both seller and buyer are ObjectIds that reference a User collection:

{
  name,
  emails: [],
  // etc.
}

Buyers and sellers can be involved in multiple Sales.
My goal is to get a list of unique users that are attached to a bunch of Sales. Sortable on name and email (last one in the array).

I know how to use $lookup to populate the buyer and seller from the User coll. And if it were only one user-filed in Sale I would use $group to get a unique list.

I have also tried to get the buyers and sellers in separate queries and then use javascript to get a list of unique ids, but this takes way too long.

Is there and efficient aggregation way to do this?

Here is a sample aggregation, that you can use to sort your sellers&buyers users by their user.name (or other fields):

const pipeline = [
  // first, group ids of buyers and sellers
  {
    $group: {
      _id: null,
      buyersIds: {
        $addToSet: '$buyer',
      },
      sellersIds: {
        $addToSet: '$seller',
      },
    },
  },
  // then, concat grouped ids into single array 
  // to fetch users with one single $lookup
  // ids in the array will not be unique, 
  // but that does not matter for $lookup
 
  // PS: you can $unwind 'userIsds' 
  // and then $group them like we did with 'buyersIds' above,
  // if unique values in 'usersIds' bother you :)
  {
    $project: {
      usersIds: {
        $concatArrays: ['$buyersIds', '$sellersIds'],
      },
    },
  },
  // use $lookup to join users by ids from accumulated array
  {
    $lookup: {
      from: '',
      localField: 'usersIds',
      foreignField: '_id',
      as: 'users',
    },
  },
  // unwind 'users' to be bring out each user to upper level
  {
    $unwind: '$users',
  },
  // make each user a root object in the pipeline array
  {
    $replaceRoot: {
      newRoot: '$users',
    },
  },
  // sort user objects, like you want
  {
    $sort: {
      name: 1,
    },
  },
];

To be able to sort by user email, that lays in the array, you will need to replace the $sort stage in above aggregation to this:

// destruct your emails array to be able to sort by its values
{
  $unwind: '$emails',
},
// do the sort
{
  $sort: {
    name: 1,
    emails: -1,
  },
},
// re-construct user objects like they were before $unwind
{
  group: {
    _id: '$_id',
    emails: {
    $push: '$emails',
  },
  // and for the rest fields
  sampleField: {
    $first: '$sampleFields',
  },
  // ...
},

This will work, but it can be not as performant, as you may want :slight_smile:

Instead, consider adding redundancy to your documents:

{
  // note, that ObjectId must link to _id from the collection, 
  // that you would do the $lookup from.
  buyer: { _id, ObjectId, emails: [], name: 'Rob' },
  seller: { _id, ObjectId, emails: [], name: 'Bob' },
  //  ... other fields
}

The above structure will allow sort buyers and sellers very fast and without $lookups and any aggregations. But, still, you will not be able to sort by ‘emails’ array.

With the above structure can do one of the following:
a) add ‘primaryEmail’ to buyer and seller objects and then you can sort by both fields: ‘name’ and ‘primaryEmail’. Additionally, you can use regex to search users that have part of a searched ‘name’/‘email’.
b) simply filter out users and to not have specified email with $in operator and then sort by ‘name’ field.

1 Like

Thanks a lot Slava!

This is what I have now:

    const users = await Sale.aggregate([
      { $match: { project: { $in: projectIds } } },
      {
        $group: {
          _id: null,
          buyersIds: {
            $addToSet: '$buyer',
          },
          sellersIds: {
            $addToSet: '$seller',
          },
        },
      },
      {
        $project: {
          usersIds: {
            $concatArrays: ['$buyersIds', '$sellersIds'],
          },
        },
      },
      { $unwind: '$usersIds' },
      { $group: { _id: null, usersIds: { $addToSet: '$usersIds' } } },
      {
        $lookup: {
          from: 'users',
          localField: 'usersIds',
          foreignField: '_id',
          as: 'users',
        },
      },
      {
        $unwind: '$users',
      },
      {
        $replaceRoot: {
          newRoot: '$users',
        },
      },
      {
        $project: {
          name: 1,
          email: { $arrayElemAt: ['$emails', -1] },
        },
      },
      { $sort: { email: 1 } },
      { $skip: 10000 },
      { $limit: 10 },
    ])

I went with a different approach for the email sorting, and it is performant enough for now.

One thing I’d like to double check with you, is what you said about the uniqueness. I added the lines you suggested, but because you said it jokingly, I am a bit confused. Do you mean those lines are optional, and uniqueness is somehow ensured further down the pipeline?

My joke referred to $unwind’ing array of emails, sort and then grouping (to much work for mongodb).
You avoided that by selecting one email to sortBy, not the whole array. So, it is OK.

Another thing is that why not write that email to a separate field in the document, like I suggested?
That way you would need to select it only 1 time, when you write a doc to a collection (plus times, when you change that email). Currently you select that email for each of 10.000 documents (due to your aggregation example) for each read operation. That means, if you run this query 100 times per hour, mongodb will do the exact same job 100 * 10.000 =1,000.000 times per hour :smile:

And if you need this $skip and $limit for pagination, why not put in at the beginning of the aggregation?
In this case mongodb would do everything you have before $skip operation for 10 documents, not 10.000 :wink:

As for uniqueness of Ids, I do not think you will have any benefit in performance if you make the items in the array unique. IMHO, for $lookup it will not make a big difference, but you will spend some calculation time on $unwind and $group stages to make them unique. And, of course, more stages = more code to read & maintain.

That makes sense, but that would require quite some refactoring on the api and the frontend. I’ll make a note of it.

I tried it, but I get very different results. The list of users needs to go through the sorting stage first, before selecting a sublist to return to the client.

My concern here is not so much performance, but that the user list does not contain duplicates when the data is presented to the end user.