Finding duplicate documents before creating unique index

Vinay_reddy_Mamedi · July 3, 2020, 10:57am

I have added unique key for name field after that I have drop the indexes and I am trying to create indexes I am getting duplicate error because we have duplicate documents in the collection. how to get the duplicate documents please help me to resolve this issue

slava · July 3, 2020, 1:39pm

Welcome to the community, @Vinay_reddy_Mamedi!

Let’s assume we have this data in collection ‘test1’:

db.test1.insertMany([
  { _id: 1, val: 'A', },
  { _id: 2, val: 'B', },
  { _id: 3, val: 'C', },
  { _id: 4, val: 'A', },
])

Then, to find duplicates we can use this aggregation:

db.test1.aggregate([
  {
    $group: {
      // collect ids of the documents, that have same value 
      // for a given key ('val' prop in this case)
      _id: '$val',
      ids: {
        $push: '$_id'
      },
      // count N of duplications per key
      totalIds: {
        $sum: 1,
      }
    }
  },
  {
    $match: {
      // match only documents with duplicated value in a key
      totalIds: {
        $gt: 1,
      },
    },
  },
  {
    $project: {
      _id: false,
      documentsThatHaveDuplicatedValue: '$ids',
    }
  },
]);

This will output ids:

{ "documentsThatHaveDuplicatedValue" : [ 1, 4 ] }

It is also possible join full documents with duplicated values, if just ids is not enough for you.
You can do this by adding $lookup stage in the end of the pipeline:

{
  $lookup: {
    // note, you need to use same collection name here
    from: 'test1', 
    localField: 'documentsThatHaveDuplicatedValue',
    foreignField: '_id',
    as: 'documentsThatHaveDuplicatedValue'
  }
}

Output, after adding the $lookup stage:

{
  "documentsThatHaveDuplicatedValue": [
    {
      "_id" : 1,
      "val" : "A"
    },
    {
      "_id" : 4,
      "val" : "A"
    }
  ]
}

Prasad_Saya · July 4, 2020, 7:19am

@Vinay_reddy_Mamedi and @slava,

This is recent post on StackOverflow.com with a similar question and an answer. From the post’s answer:

Assuming a collection documents with name (using name instead of url ) field consisting duplicate values. I have two aggregations which return some output which can be used to do further processing. I hope you will find this useful.
…