Get field unique values with wildcard index

Mordechai_Ben_Zechar · August 23, 2020, 7:54am

Hi

I have wildcard index all over the collection.
I’m trying to get a list of unique values

I tried to use

    "pipeline": [
          {
            "$group": {
              "uniqueValues": {
                "$addToSet": "$project"
              },
              "_id": null
            }
          }
        ]

But it was super slow then I read that in order to use the field index I should add the sort stage before, so I tried this one:

    "pipeline": [
          {
            "$sort": {
              "project": 1
            }
          },
          {
            "$group": {
              "uniqueValues": {
                "$addToSet": "$project"
              },
              "_id": null
            }
          }
        ]

But still it not using the index (“planSummary”: [ { “COLLSCAN”: {} } ]) and now I got “errMsg”: “Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting.”

How can I get the unique values and use the wildcard index?

Thank you

Pavel_Duchovny · August 23, 2020, 10:54am

Hi @Mordechai_Ben_Zechar,

Your grouping does not use a specific _id therefore will result in a collection scan.

The correct way is to group the relevant field for distinct:

https://docs.mongodb.com/manual/reference/operator/aggregation/group/#retrieve-distinct-values

“pipeline”: [
{
“$sort”: {
“project”: 1
}
},
{
“$group”: {
“_id”:  "$project"
}
}]

Best
Pavel

Mordechai_Ben_Zechar · August 23, 2020, 11:39am

Hi @Pavel_Duchovny and thank you for your answer but it still not working and not using the wildcard index,

db.mycol.aggregate([ { "$sort": { "project": 1 } }, { "$group": { "_id":  "$project" } }])
2020-08-23T14:36:53.282+0300 E QUERY    [js] Error: command failed: {
        "operationTime" : Timestamp(1598182604, 27),
        "ok" : 0,
        "errmsg" : "Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting.",
        "code" : 16819,
        "codeName" : "Location16819",
        "$clusterTime" : {
                "clusterTime" : Timestamp(1598182604, 27),
                "signature" : {
                        "hash" : BinData(0,"hZjsV/H8hEr6LK1l0RtfZzT7jB0="),
                        "keyId" : NumberLong("6811153382187728897")
                }
        }
}

: aggregate failed :

Pavel_Duchovny · August 23, 2020, 12:53pm

Hi @Mordechai_Ben_Zechar,

You need to allowDiskUse: true if any of the stages cross 100MB of execution memory.

Please note that a full index scan is usually less effecient than collection scan , collection scans are optimised by design for initial sync of replica sets and can be very perfomant.

Best
Pavel