Split of chunks fails with "unable to save chunk ops"

hi all,

I just created another sharded collection, which per default make a single chunk for it in my default shard “db_rs002”. I then deliberately run myself a sequence of “sh.splitAt()” commands on this collection to split up that initial chunk into 3600 new chunks.

This process went fine for several weeks, but now it fails on the last split, for a new collection “cdr_af_20200913” with the following error:

...
splitting for 3599
{
    "ok" : 0,
    "errmsg" : "split failed :: caused by :: chunk operation commit failed: version 1|2||5f77ae0f5f721c6a0fa5412d doesn't exist in namespace: cdrarch.cdr_af_20200913. Unable to save chunk ops. Command: { applyOps: [ { op: \"u\", b: true, ns: \"config.chunks\", o: { _id: \"cdrarch.cdr_af_20200913-SHARD_MINSEC_MinKey\", lastmod: Timestamp(1, 1), lastmodEpoch: ObjectId('5f77ae0f5f721c6a0fa5412d'), ns: \"cdrarch.cdr_af_20200913\", min: { SHARD_MINSEC: MinKey }, max: { SHARD_MINSEC: 3599.0 }, shard: \"db_rs002\", history: [ { validAfter: Timestamp(1601678863, 5), shard: \"db_rs002\" } ] }, o2: { _id: \"cdrarch.cdr_af_20200913-SHARD_MINSEC_MinKey\" } }, { op: \"u\", b: true, ns: \"config.chunks\", o: { _id: \"cdrarch.cdr_af_20200913-SHARD_MINSEC_3599.0\", lastmod: Timestamp(1, 2), lastmodEpoch: ObjectId('5f77ae0f5f721c6a0fa5412d'), ns: \"cdrarch.cdr_af_20200913\", min: { SHARD_MINSEC: 3599.0 }, max: { SHARD_MINSEC: MaxKey }, shard: \"db_rs002\", history: [ { validAfter: Timestamp(1601678863, 5), shard: \"db_rs002\" } ] }, o2: { _id: \"cdrarch.cdr_af_20200913-SHARD_MINSEC_3599.0\" } } ], preCondition: [ { ns: \"config.chunks\", q: { query: { ns: \"cdrarch.cdr_af_20200913\", min: { SHARD_MINSEC: MinKey }, max: { SHARD_MINSEC: MaxKey } }, orderby: { lastmod: -1 } }, res: { lastmodEpoch: ObjectId('5f77ae0f5f721c6a0fa5412d'), shard: \"db_rs002\" } } ], writeConcern: { w: 1, wtimeout: 0 } }. Result: { applied: 1, code: 11000, codeName: \"DuplicateKey\", errmsg: \"E11000 duplicate key error collection: config.chunks index: ns_1_min_1 dup key: { ns: \"cdrarch.cdr_af_20200913\", min: { SHARD_MINSEC: MinKey } }\", results: [ false ], ok: 0.0, keyPattern: { ns: 1, min: 1 }, keyValue: { ns: \"cdrarch.cdr_af_20200913\", min: { SHARD_MINSEC: MinKey } }, $gleStats: { lastOpTime: { ts: Timestamp(1601679212, 23), t: 12 }, electionId: ObjectId('7fffffff000000000000000c') }, lastCommittedOpTime: Timestamp(1601679212, 23), $clusterTime: { clusterTime: Timestamp(1601679212, 23), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, operationTime: Timestamp(1601679212, 23) } :: caused by :: E11000 duplicate key error collection: config.chunks index: ns_1_min_1 dup key: { ns: \"cdrarch.cdr_af_20200913\", min: { SHARD_MINSEC: MinKey } }",
    "code" : 11000,
    "codeName" : "DuplicateKey",
    "keyPattern" : {
            "ns" : 1,
            "min" : 1
    },
    "keyValue" : {
            "ns" : "cdrarch.cdr_af_20200913",
            "min" : {
                    "SHARD_MINSEC" : { "$minKey" : 1 }
            }
    },
    "operationTime" : Timestamp(1601679209, 31),
    "$clusterTime" : {
            "clusterTime" : Timestamp(1601679212, 24),
            "signature" : {
                    "hash" : BinData(0,"xr0fqbLl5LvtpBTMfrb1vi4DUu0="),
                    "keyId" : NumberLong("6821045628173287455")
            }
    }
}

What can cause this “E11000 duplicate key error collection…” situation, and how to fix it?

Seems related to https://jira.mongodb.org/browse/SERVER-40061 : that issue is closed, but we hit this similar (or same?) problem with our MongoDB version 4.4.1 (we recently upgraded from 4.2.0, and now we hit this issue)

ok, by lack of any feedbacks, I continue my own desparate attempts to get MongoDB usable again:

  1. we made sure that we still have the source files with input data of our 4 TB MongoDB deployment, before dropping all existing collections in an attempt to trash all chunk registrations
  2. then I re-created a new collection, and started the split of its initial chunk into 3600 new chunks; to no avail

The error message after the last chunk was being split:

 ...
 splitting for 3599
{
    "ok" : 0,
    "errmsg" : "split failed :: caused by :: chunk operation commit failed: version 1|2||5f7d9b275f721c6a0f3fa945 doesn't exist in namespace: cdrarch.cdr_mobi_20200927. Unable to save chunk ops. Command: { applyOps: [ { op: \"u\", b: true, ns: \"config.chunks\", o: { _id: \"cdrarch.cdr_mobi_20200927-SHARD_MINSEC_MinKey\", lastmod: Timestamp(1, 1), lastmodEpoch: ObjectId('5f7d9b275f721c6a0f3fa945'), ns: \"cdrarch.cdr_mobi_20200927\", min: { SHARD_MINSEC: MinKey }, max: { SHARD_MINSEC: 3599.0 }, shard: \"db_rs002\", history: [ { validAfter: Timestamp(1602067239, 5), shard: \"db_rs002\" } ] }, o2: { _id: \"cdrarch.cdr_mobi_20200927-SHARD_MINSEC_MinKey\" } }, { op: \"u\", b: true, ns: \"config.chunks\", o: { _id: \"cdrarch.cdr_mobi_20200927-SHARD_MINSEC_3599.0\", lastmod: Timestamp(1, 2), lastmodEpoch: ObjectId('5f7d9b275f721c6a0f3fa945'), ns: \"cdrarch.cdr_mobi_20200927\", min: { SHARD_MINSEC: 3599.0 }, max: { SHARD_MINSEC: MaxKey }, shard: \"db_rs002\", history: [ { validAfter: Timestamp(1602067239, 5), shard: \"db_rs002\" } ] }, o2: { _id: \"cdrarch.cdr_mobi_20200927-SHARD_MINSEC_3599.0\" } } ], preCondition: [ { ns: \"config.chunks\", q: { query: { ns: \"cdrarch.cdr_mobi_20200927\", min: { SHARD_MINSEC: MinKey }, max: { SHARD_MINSEC: MaxKey } }, orderby: { lastmod: -1 } }, res: { lastmodEpoch: ObjectId('5f7d9b275f721c6a0f3fa945'), shard: \"db_rs002\" } } ], writeConcern: { w: 1, wtimeout: 0 } }. Result: { applied: 1, code: 11000, codeName: \"DuplicateKey\", errmsg: \"E11000 duplicate key error collection: config.chunks index: ns_1_min_1 dup key: { ns: \"cdrarch.cdr_mobi_20200927\", min: { SHARD_MINSEC: MinKey } }\", results: [ false ], ok: 0.0, keyPattern: { ns: 1, min: 1 }, keyValue: { ns: \"cdrarch.cdr_mobi_20200927\", min: { SHARD_MINSEC: MinKey } }, $gleStats: { lastOpTime: { ts: Timestamp(1602067429, 63), t: 12 }, electionId: ObjectId('7fffffff000000000000000c') }, lastCommittedOpTime: Timestamp(1602067429, 63), $clusterTime: { clusterTime: Timestamp(1602067429, 63), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, operationTime: Timestamp(1602067429, 63) } :: caused by :: E11000 duplicate key error collection: config.chunks index: ns_1_min_1 dup key: { ns: \"cdrarch.cdr_mobi_20200927\", min: { SHARD_MINSEC: MinKey } }",
    "code" : 11000,
    "codeName" : "DuplicateKey",
    "keyPattern" : {
            "ns" : 1,
            "min" : 1
    },
    "keyValue" : {
            "ns" : "cdrarch.cdr_mobi_20200927",
            "min" : {
                    "SHARD_MINSEC" : { "$minKey" : 1 }
            }
    },
    "operationTime" : Timestamp(1602067422, 33),
    "$clusterTime" : {
            "clusterTime" : Timestamp(1602067429, 65),
            "signature" : {
                    "hash" : BinData(0,"E6blKGQrwEdQmky1zS1LY0txJGc="),
                    "keyId" : NumberLong("6821045628173287455")
            }
    }
}

-> getting more desparate now…