Unbalanced data repartition between shard

FRANCK_LEFEBURE · September 21, 2020, 8:36pm

Hi,
We have a quite big collection that is replicated and sharded (2 shards, 1 replica by shard)
The data is unbalanced betwwen the 2 shard :

mongos> db.getCollection('cases').getShardDistribution()

Shard shard1 at shard1/172.23.138.32:27017
 data : 1363.79GiB docs : 79164325 chunks : 171591
 estimated data per chunk : 8.13MiB
 estimated docs per chunk : 461

Shard shard2 at shard2/opmongodbas4:27017
 data : 4522.08GiB docs : 376106878 chunks : 171589
 estimated data per chunk : 26.98MiB
 estimated docs per chunk : 2191

Totals
 data : 5885.88GiB docs : 455271203 chunks : 343180
 Shard shard1 contains 23.17% data, 17.38% docs in cluster, avg obj size on shard : 18KiB
 Shard shard2 contains 76.82% data, 82.61% docs in cluster, avg obj size on shard : 12KiB

The shard key is {context: 1, action:1, creation 1}
context has a very low cardinality (much of the data has the same value)

action has a medium cardinality (some hundreds)
creation is the creation timestamp of the data

Any direction to dig ?

MaBeuLux88 · September 22, 2020, 1:01am

Hi @FRANCK_LEFEBURE and welcome in the MongoDB community !

I’m wondering if your shard_key is not growing monotonically here. Are your write operations distributed evenly across all the chunks?

Some good doc:

Also, is the balancer activated? Is it running smoothly?

sh.status() can probably give you these information.

Cheers,
Maxime.

FRANCK_LEFEBURE · September 22, 2020, 7:25pm

I tried some little chunk merge .
Eg on this little sub dataset :

db.getCollection('chunks').find({ns : 'softbridge4.cases', min:{
    '$gte':{context:'orange', actionId:191, creation: ISODate('2019-06-17T03:41:41.000-0400')},
    '$lte':{context:'orange', actionId:191, creation: ISODate('2019-06-24T03:21:05.000-0400')}
    }}).sort({ns:1, min:1})

That corresponds to 4 chunks contiguous on the same shard :

    softbridge4.cases-context_"orange"actionId_191creation_new Date(1560757301000),shard1,36932,2
    softbridge4.cases-context_"orange"actionId_191creation_new Date(1561357092000),shard1,0,0
    softbridge4.cases-context_"orange"actionId_191creation_new Date(1561359957000),shard1,0,0
    softbridge4.cases-context_"orange"actionId_191creation_new Date(1561360865000),shard1,553980,30

With this command (balancer stopped):

db.adminCommand( {
   mergeChunks: "softbridge4.cases",
   bounds: [ {context:'orange', actionId:191, creation: ISODate('2019-06-17T03:41:41.000-0400')},
             {context:'orange', actionId:191, creation: ISODate('2019-06-24T03:21:05.000-0400')} ]
} )

But I end with this error :

Failed to commit chunk merge :: caused by :: DuplicateKey: chunk operation commit failed: version 97732|17||5d5efd91753aa982feb1ecfb doesn't exist in namespace: softbridge4.cases. Unable to save chunk ops. Command: { applyOps: [ { op: "u", b: false, ns: "config.chunks", o: { _id: "softbridge4.cases-context_"orange"actionId_191.0creation_new Date(1560757301000)", ns: "softbridge4.cases", min: { context: "orange", actionId: 191.0, creation: new Date(1560757301000) }, max: { context: "orange", actionId: 191, creation: new Date(1561360865000) }, shard: "shard1", lastmod: Timestamp(97732, 17), lastmodEpoch: ObjectId('5d5efd91753aa982feb1ecfb') }, o2: { _id: "softbridge4.cases-context_"orange"actionId_191.0creation_new Date(1560757301000)" } }, { op: "d", ns: "config.chunks", o: { _id: "softbridge4.cases-context_"orange"actionId_191creation_new Date(1561357092000)" } }, { op: "d", ns: "config.chunks", o: { _id: "softbridge4.cases-context_"orange"actionId_191creation_new Date(1561359957000)" } } ], preCondition: [ { ns: "config.chunks", q: { query: { ns: "softbridge4.cases", min: { context: "orange", actionId: 191.0, creation: new Date(1560757301000) }, max: { context: "orange", actionId: 191, creation: new Date(1561357092000) } }, orderby: { lastmod: -1 } }, res: { lastmodEpoch: ObjectId('5d5efd91753aa982feb1ecfb'), shard: "shard1" } }, { ns: "config.chunks", q: { query: { ns: "softbridge4.cases", min: { context: "orange", actionId: 191, creation: new Date(1561357092000) }, max: { context: "orange", actionId: 191, creation: new Date(1561359957000) } }, orderby: { lastmod: -1 } }, res: { lastmodEpoch: ObjectId('5d5efd91753aa982feb1ecfb'), shard: "shard1" } }, { ns: "config.chunks", q: { query: { ns: "softbridge4.cases", min: { context: "orange", actionId: 191, creation: new Date(1561359957000) }, max: { context: "orange", actionId: 191, creation: new Date(1561360865000) } }, orderby: { lastmod: -1 } }, res: { lastmodEpoch: ObjectId('5d5efd91753aa982feb1ecfb'), shard: "shard1" } } ], writeConcern: { w: 0, wtimeout: 0 } }. Result: { applied: 1, code: 11000, codeName: "DuplicateKey", errmsg: "E11000 duplicate key error collection: config.chunks index: ns_1_min_1 dup key: { : "softbridge4.cases", : { context: "orange", actionId: 191.0, creat...", results: [ false ], ok: 0.0, operationTime: Timestamp(1600798453, 1609), $gleStats: { lastOpTime: { ts: Timestamp(1600798453, 1609), t: 6 }, electionId: ObjectId('7fffffff0000000000000006') }, $clusterTime: { clusterTime: Timestamp(1600798453, 2328), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } } :: caused by :: E11000 duplicate key error collection: config.chunks index: ns_1_min_1 dup key: { : "softbridge4.cases", : { context: "orange", actionId: 191.0, creation: new Date(1560757301000) } }

FRANCK_LEFEBURE · September 22, 2020, 7:25pm

Hi Maxime, thanks for your quick reply,

The last part of the composite shard key (the timestamp one) grows monotically. The write operations seem evenly distributed.
The design of the shard key was initially to optimise the read preference (Every costly read request include criteria on the 3 partition subkey). It seems effective.
At this moment I prefer to not think about a repartioning cause it would be a true pain.

I’ve dig a little since yesterday and I understand a little more the problem.

We do a lot of update/delete on this collection.
For some reason, one of the “action” value documents are over affected by delete.
Each shard have ~170000 chunks
~130000 chunks on the shad1 are about document with the same value action=191
On theses 130000 chunks, ~124000 are empty chunks. That explains the unbalance. As I understand the balancer only takes care of the chunks number on each shard.

My first non-understanding is about all the chunks with the same “action” values that stay on the same shard

softbridge4.cases-context_"orange"actionId_191creation_new Date(1600775145000),shard1,0,0
softbridge4.cases-context_"orange"actionId_191creation_new Date(1600775525000),shard1,0,0
softbridge4.cases-context_"orange"actionId_191creation_new Date(1600775740000),shard1,0,0
softbridge4.cases-context_"orange"actionId_191creation_new Date(1600775885000),shard1,0,0
softbridge4.cases-context_"orange"actionId_191creation_new Date(1600776045000),shard1,0,0

Why don’t the different values of date distribute on both shard ? Is it a consequence of the monotically increase of theses values ? What is the root cause of this strategy ?

To correct the unbalance, i’m planning to :

stop the balancer
write & execute a script to delete empty chunks and to merge small one
start the balancer

After theses operations, the number of chunks on the shard1 will be lower. May be ~50000.
Can I hope that the balancer will redistribute the chunks evenly after this operation ?
Or should I expect to have to write a job to move some chunks manually ?

Franck

FRANCK_LEFEBURE · September 23, 2020, 12:17am

A missing info : the cluster was 3.6.13 based.
It has been upgraded today to 3.6.20 (there was some tickets in Jira related to chunks)
But this upgrade didn’t fix the mergeChunks problem.
Tomorrow I will try to reproduce the problem on a dev server

FRANCK_LEFEBURE · September 23, 2020, 7:40pm

Well I think I have a workaround for the mergeChunks problem.
I think this is somewhere related to a Bson de/serialization problem

I’ve mounted a sharded database with a simpler partition key : {action:1}
The action values are integers

if I execute a command like this one in a mongo shell or in Robo3T :

db.adminCommand( {
   mergeChunks: "test.cases",
   bounds: [ {actionId:191)},
             {actionId:250)} ]
} )

Then the command fails on a DuplicateKey error

Now if I execute a script like this one

db.getSiblingDB("config").chunks.find({ns : 'test.cases', min:{actionId:191}}).forEach(function(chunk) {chunk1=chunk;});
db.getSiblingDB("config").chunks.find({ns : 'test.cases', min:{actionId:250}}).forEach(function(chunk) {chunk2=chunk;});
db.adminCommand( { mergeChunks : 'softbridge.cases' , bounds:[chunk1.min,chunk2.min]});

Then the execution is OK.

I think there is at least a documentation bug here

MaBeuLux88 · September 23, 2020, 8:06pm

I don’t know what’s happening exactly with the mergeChunks command. It could also be something that has been fixed in later versions. I would recommend to upgrade if possible.

Regarding the issue with the empty / almost empty chunks. The only way to solve the problem is to merge them so the balancer could balance non-empty chunks correctly.

When you massively delete data & it changes dramatically the way your data is split across the chunks, you need to consider running a few mergeChunks commands to regroup the newly empty chunks.

There is no other way to solve this issue without completely redesigning the shard key and the way the documents are organised & stored in the cluster.

Cheers,
Maxime.

FRANCK_LEFEBURE · September 23, 2020, 9:23pm

I came to this conclusion.
A chunks merging job, based on mongo/sharding_utils.js at master · vmenajr/mongo · GitHub is actually running on the cluster.
For the possible bug, I will give a try on the latest 4.4 and will open a Jira if it’s still relevant
Thanks