Chunk movements stopped going quick

hi gurus,

  • using v4.2.8 and now upgraded to 4.4.0
  • disabled the automatic chunk balancer,
  • we pre-create collections, split them into 3600 chunks, then doing manual chunk movements over 36 shards (each replica sets of 3 servers)

We had a situation where manual chunk movements were started (manually hence) while data loading was busy at the same time. The MongoDB cluster granted to a halt…
So we interrupted the “mongo” client from where the chunk movements were initiated.

Now, with a completely idle cluster (no data loading, nada happening on it), we restarted those chunk movements of an entirely new, thus empty (no docs) collection. And it goes slow ! On chunk moved every 15 mins or so.
And no, the servers underneath are not busy doing other things. It’s dedicated hardware (12 servers), local storage, plenty of RAM. None of their resources are exhausted. Network is 1Gbps NICs, PING roundtrips are sub-msec speeds.

How can we find the cause of those super slow chunk movements ? Sure we can raise the logging verbosity to any desired level, but it contains tons of messages of things going well… So not of great help, unless we know in advance where to look for ‘the’ problem.

Eg messages we get in the “mongod.log” of the primary server of the receiving shard, are this (every second):

{"t":{"$date":"2020-08-27T10:44:02.290+02:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn1258","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"_recvChunkStatus":"cdrarch.cdr_af_20200830","waitForSteadyOrDone":true,"sessionId":"db_rs002_db_rs035_5f476ff3da305f92f9c94d31","$clusterTime":{"clusterTime":{"$timestamp":{"t":1598517840,"i":2}},"signature":{"hash":{"$binary":{"base64":"ih0J82H8GKLsFMBDFj7Xoa0DJho=","subType":"0"}},"keyId":6821045628173287455}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1598517840,"i":1}},"t":9}},"$db":"admin"},"numYields":0,"reslen":734,"locks":{},"protocol":"op_msg","durationMillis":1001}}
{"t":{"$date":"2020-08-27T10:44:03.294+02:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn1258","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"_recvChunkStatus":"cdrarch.cdr_af_20200830","waitForSteadyOrDone":true,"sessionId":"db_rs002_db_rs035_5f476ff3da305f92f9c94d31","$clusterTime":{"clusterTime":{"$timestamp":{"t":1598517841,"i":1}},"signature":{"hash":{"$binary":{"base64":"rd++AdPb36F7OEENJuunnvxzLEA=","subType":"0"}},"keyId":6821045628173287455}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1598517841,"i":1}},"t":9}},"$db":"admin"},"numYields":0,"reslen":734,"locks":{},"protocol":"op_msg","durationMillis":1001}}

and so on.
These logs appear because of “slow query”. What is the query then? I see “_recvChunkStatus” which refers to the name of the collection (“cdrarch.cdr_af_20200830”).
The sending replicaset/shard is named “db_rs002”, the receiving replicaset/shard is named “db_rs035”.

Any help to get more visibility on what’s going slow under the hood would be greatly appreciated !

thx in advance !
Rob

I tried to find any documentation about how the chunk movements can be traced with any verbosity, but this seems nowhere documented.
The chunk migrations (empty chunks, zero documents in them) of newly created collections still go horribly slow, nearly granted to a halt (one migration every 15mins or so).
Still, there is zero other activity on my cluster, we cannot use it until those new collections are ready with their chunks on the right shard servers.
MongoDB server version is 4.4.0

The logs of the router servers show:

...
{"t":{"$date":"2020-08-31T06:25:09.557+02:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn357","msg":"Slow query","attr":{"type":"command","ns":"_KEYHOLE_88800.$cmd","command":{"dbStats":1,"lsid":{"id":{"$uuid":"b8998ee1-a8d9-46be-ba84-d0d1c40b19bd"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1598847907,"i":4}},"signature":{"keyId":6821045628173287455,"hash":{"$binary":"base64":"7ks/HUkgj+LvEKYINCIfalrKvVU=","subType":"0"}}}},"$db":"_KEYHOLE_88800"},"numYields":0,"reslen":11324,"protocol":"op_query","durationMillis":1216}}
{"t":{"$date":"2020-08-31T06:25:09.794+02:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn357","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"dbStats":1,"lsid":{"id":{"$uuid":"b8998ee1-a8d9-46be-ba84-d0d1c40b19bd"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1598847908,"i":1}},"signature":{"keyId":6821045628173287455,"hash":{"$binary":"base64":"n4DO3e0Xx2Fv0taEA1k7CqkR97U=","subType":"0"}}}},"$db":"admin"},"numYields":0,"reslen":11630,"protocol":"op_query","durationMillis":234}}
{"t":{"$date":"2020-08-31T06:25:12.726+02:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn357","msg":"Slow query","attr":{"type":"command","ns":"cdrarch.$cmd","command":{"dbStats":1 ,"lsid":{"id":{"$uuid":"b8998ee1-a8d9-46be-ba84-d0d1c40b19bd"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1598847909,"i":3}},"signature":{"keyId":6821045628173287455,"hash":{"$binary":{"base64":"RaCGq/SXVR7wYB7Nb0L/G+R2OfI=","subType":"0"}}}},"$db":"cdrarch"},"numYields":0,"reslen":11718,"protocol":"op_query","durationMillis":2930}}
{"t":{"$date":"2020-08-31T06:25:14.738+02:00"},"s":"I",  "c":"SHARDING", "id":20997,   "ctx":"Uptime-reporter","msg":"Refreshed RWC defaults","attr":{"newDefaults":{}}}
{"t":{"$date":"2020-08-31T06:25:15.631+02:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn357","msg":"Slow query","attr":{"type":"command","ns":"config.$cmd","command":{"dbStats":1,"lsid":{"id":{"$uuid":"b8998ee1-a8d9-46be-ba84-d0d1c40b19bd"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1598847912,"i":1}},"signature":{"keyId":6821045628173287455,"hash":{"$binary":{"base64":"XJgwP/q5znnKbcpFEEaobUrBeF4=","subType":"0"}}}},"$db":"config"},"numYields":0,"reslen":11666,"protocol":"op_query","durationMillis":2902}}
{"t":{"$date":"2020-08-31T06:25:16.460+02:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn357","msg":"Slow query","attr":{"type":"command","ns":"_KEYHOLE_88800.$cmd","command":{"dbStats":1,"lsid":{"id":{"$uuid":"b8998ee1-a8d9-46be-ba84-d0d1c40b19bd"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1598847915,"i":2}},"signature":{"keyId":6821045628173287455,"hash":{" $binary":{"base64":"7M14WAvT4WlyqOnFwnuTbCiPLdQ=","subType":"0"}}}},"$db":"_KEYHOLE_88800"},"numYields":0,"reslen":11324,"protocol":"op_query","durationMillis":120}}
{"t":{"$date":"2020-08-31T06:25:24.746+02:00"},"s":"I",  "c":"SHARDING", "id":20997,   "ctx":"Uptime-reporter","msg":"Refreshed RWC defaults","attr":{"newDefaults":{}}}
...
{"t":{"$date":"2020-08-31T06:35:19.076+02:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn142636","msg":"Slow query","attr":{"type":"command","ns":"cdrarch.cdr_af_20200830","appName":"MongoDB Shell","command":{"moveChunk":"cdrarch.cdr_af_20200830","find":{"SHARD_MINSEC":608.0},"to":"db_rs033","_waitForDelete":true,"lsid":{"id":{"$uuid":"f734929b-2bc9-4c58-ab1b-a5027730cb7e"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1598847531,"i":2}},"signature":{"hash":{"$binary":"base64":"AvmYRG2AnH4Ui6mH7I22bm4c64w=","subType":"0"}},"keyId":6821045628173287455}},"$db":"admin"},"numYields":0,"reslen":175,"protocol":"op_msg","durationMillis":987521}}
{"t":{"$date":"2020-08-31T06:35:19.485+02:00"},"s":"I",  "c":"SH_REFR",  "id":24104,   "ctx":"ConfigServerCatalogCacheLoader-1495","msg":"Refreshed cached collection","attr":{"namespace":"cdrarch.cdr_af_20200830","newVersion":{"0":{"$timestamp":{"t":3910,"i":1}},"1":{"$oid":"5f46d8b6da305f92f9b46bdf"}},"oldVersion":" from version 3909|1||5f46d8b6da305f92f9b46bdf","durationMillis":7}}

So you see that the “moveChunk” command took nearly 1000 secs.

I many of my shard servers, I see following WiredTiger errors in the log:

{"t":{"$date":"2020-08-29T22:05:27.033+02:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"thread1","msg":"WiredTiger error","attr":{"error":95,"message":"[1598731527:33588][86:0x7fa900293700], log-server: __posix_std_fallocate, 58: /var/lib/mongodb/journal/WiredTigerTmplog.0000000007: fallocate:: Operation not supported"}}

Are these “fallocate” errors harmless or not ?

I see that this file is perfectly existing:

# pwd
/var/lib/mongodb/journal
# ls -ltr
total 20560
-rw------- 1 mongodb mongodb 104857600 Aug 30 15:59 WiredTigerPreplog.0000000007
-rw------- 1 mongodb mongodb 104857600 Sep  1 09:52 WiredTigerPreplog.0000000010
-rw------- 1 mongodb mongodb 104857600 Sep  1 17:45 WiredTigerLog.0000000997

Hi @Rob_De_Langhe

What are the vital statistics, OS, Mongo Version. And what is the filesystem type of /var/lib/mongodb/journal

df -T /var/lib/mongodb/journal

hi Chris,
thx a lot for your feedback.
The servers run Ubuntu 18, MongoDB 4.4.0, all inside LXC containers, 36 shards of 3 servers each.
The servers are nearly idling (some 10-15% busy CPU, several 10’s of GB RAM free).
The filesystems underneath the LXC containers (including their MongoDB directories) is ZFS.
I read about some issue with ZFS and the “fallocate” system calls to pre-allocate big files like those journal files.
So I am now gradually converting my servers away from ZFS and towards LVM/mdadm, very unfortunately because we will miss the flexibilities and ease of working with ZFS.
But still I suspect that the chunk inventory on my config servers is somehow corrupted, and no longer being updated efficiently with new moves (hence the dramatic delays, I think). Also new moves requests take very long time and then seem to be suddenly cancelled or skipped because they terminate without any message but still are not executed (as shown by “sh.status()” ).
Is this assumption possible? Is there some check that I could run to verify that all internal collections (like the chunks locations) are still consistent ?

thx again for any further feedbacks, mostly appreciated !
Rob