Mongodb Replicaset Crashes when upgrade to version 4.0

We have a three node cluster with one primary and two secondary nodes with 150GB of data and heavy write and read operations.
When cluster was operating in version 3.6, Large amount of slow queries were observed and data packet processing is taking more than 10seconds.
As part of performance optimisationa and multi-document transaction, we upgraded mongodb version to 4.0.23 and query processing speed improved, But within 2-3 hours load primary goes down and one of the secondary nodes becomes unhealthy.
Why mongodb is surviving with version 3.6 and high load, but not on 4.0?
Can anyone help here?

Oplog : 71GB,
RAM: 64GB
Instance: r4.2xLarge EC2 Instance

Please find the output of monostat command.

insert query update delete getmore command dirty  used flushes vsize   res   qrw   arw net_in net_out conn set repl                time
   187    *0   1855     45     196   544|0  5.1% 79.9%       0 34.8G 28.3G 0|456 6|128  3.21m    120m 1264 rs0  PRI Apr 23 17:33:31.297
   198    *0   1619     21     199   519|0  5.2% 80.1%       0 34.8G 28.3G 0|469 4|128  2.93m    125m 1264 rs0  PRI Apr 23 17:33:32.298
   172    *0   3200     48     158   613|0  5.7% 80.4%       0 34.8G 28.3G 0|268 1|128  4.17m    122m 1263 rs0  PRI Apr 23 17:33:33.318
   101    *0   7378     86      14   731|0  4.9% 79.2%       0 34.8G 28.3G 0|343 3|128  6.05m    133m 1262 rs0  PRI Apr 23 17:33:34.296
   133    *0   2521     42      11   419|0  5.6% 79.9%       0 34.8G 28.3G 0|393 6|128  3.30m    188m 1261 rs0  PRI Apr 23 17:33:35.302
   136    *0   3022     83     126   536|0  5.5% 80.0%       0 34.8G 28.3G 0|422 2|127  4.09m    178m 1261 rs0  PRI Apr 23 17:33:36.316
   175    *0   2585     52     175   623|0  5.2% 79.8%       0 34.8G 28.3G 0|449 4|128  3.66m    158m 1268 rs0  PRI Apr 23 17:33:37.298
    95    *0   2027     31     180   528|0  5.1% 79.9%       0 34.8G 28.3G 0|429 4|128  2.92m    127m 1270 rs0  PRI Apr 23 17:33:38.298
    83    *0   2418     41     149   523|0  4.9% 79.4%       0 34.8G 28.3G 0|419 4|128  3.65m    146m 1270 rs0  PRI Apr 23 17:33:39.297
   148    *0   2288     45     171   556|0  4.9% 79.7%       0 34.8G 28.3G 0|453 1|128  3.26m    109m 1270 rs0  PRI Apr 23 17:33:40.307
insert query update delete getmore command dirty  used flushes vsize   res   qrw   arw net_in net_out conn set repl                time
   144    *0   1333     27     242   535|0  5.1% 80.0%       0 34.8G 28.3G 0|489 4|128  2.57m    103m 1271 rs0  PRI Apr 23 17:33:41.297
   140    *0   1125     22     226   458|0  5.1% 80.4%       0 34.8G 28.3G 0|498 3|128  2.42m   96.7m 1271 rs0  PRI Apr 23 17:33:42.298
   183    *0   1778     50     210   530|0  5.1% 80.2%       0 34.8G 28.3G 0|474 8|128  2.77m    122m 1271 rs0  PRI Apr 23 17:33:43.296
   217    *0   1680     30     217   556|0  4.8% 79.9%       0 34.8G 28.3G 0|408 1|127  3.01m    122m 1271 rs0  PRI Apr 23 17:33:44.311
   200    *0   4268     49     123   611|0  5.2% 80.4%       0 34.8G 28.3G 0|239 2|128  4.68m    136m 1271 rs0  PRI Apr 23 17:33:45.601
   160    *0   6644     81      24   644|0  5.7% 80.8%       0 34.8G 28.3G 0|303 3|128  6.11m    213m 1271 rs0  PRI Apr 23 17:33:46.300
    62    *0   5432     85      57   554|0  5.0% 79.9%       0 34.8G 28.3G 0|254 1|126  5.20m    148m 1271 rs0  PRI Apr 23 17:33:47.311
    89    *0   9086     90      26   545|0  5.2% 80.2%       0 34.8G 28.3G 0|146 3|128  6.07m    130m 1274 rs0  PRI Apr 23 17:33:48.300
    30    *0   8915     72       7   541|0  4.9% 79.7%       0 34.8G 28.3G 0|243 1|128  6.29m    128m 1279 rs0  PRI Apr 23 17:33:49.318
    33    *0  13579     56       2   375|0  5.4% 80.1%       0 34.8G 28.3G  0|73 1|123  7.62m   37.7m 1275 rs0  PRI Apr 23 17:33:50.409
insert query update delete getmore command dirty  used flushes vsize   res   qrw   arw net_in net_out conn set repl                time
    23    *0  12968     86       7   380|0  5.8% 80.0%       0 34.8G 28.3G  0|54 2|127  7.81m    111m 1272 rs0  PRI Apr 23 17:33:51.331
    12    *0   6958     82      14   548|0  5.1% 79.9%       0 34.8G 28.3G 0|268 3|128  4.90m    259m 1273 rs0  PRI Apr 23 17:33:52.298
    95    *0   8353    105      61   453|0  5.5% 80.2%       0 34.8G 28.3G  0|70 3|128  6.28m    103m 1275 rs0  PRI Apr 23 17:33:53.299
    95    *0   8484    103      65   410|0  5.3% 79.7%       0 34.8G 28.3G 0|171 3|128  6.08m    138m 1278 rs0  PRI Apr 23 17:33:54.299
    90    *0   9836     52      49   163|0  4.6% 79.5%       0 34.8G 28.3G  0|13 1|128  5.71m   89.4m 1279 rs0  PRI Apr 23 17:33:55.343
^Z

Hi @Rakhi_Maheshwari welcome to the community!

From the mongostat output you posted, it appears that the node was heavily utilized, up to its configured memory usage limits. I would think that the server was killed by the OOMkiller. Can you confirm if this is the case? Note that by default EC2 instances does not come configured with swap, so if any process takes a lot of memory, the OS will just kill it.

If the server continually gets killed by the OOMkiller, it seems that your instance size is too small for the workload you’re putting in. One possible solution is to provision a larger hardware and see if the issue persists.

Why mongodb is surviving with version 3.6 and high load, but not on 4.0?

You mentioned earlier that you upgraded to 4.0 for multi document transactions. Is your app using this feature? If yes, multi document transactions will incur additional memory load (see Performance Best Practices: Transactions and Read / Write Concerns), especially if there are a lot of them.

On another note, I would encourage you to try out the newer MongoDB versions (4.4.5 currently) and see if it helps your situation, since there are many improvements made since 4.0.

Best regards,
Kevin

1 Like

Hi kevinadi,
Thanks for your suggestion.
For case of OOMkiller we have checked all logs of /var/log/messages but there is no log message indicating the same. Is there any other way checking the same.
And in context to multi document transaction , we have not yet enabled feature compatability version to 4.0, so there are less chances of additional memory load.

Also , we have now downgraded to version 3.6 , and we are running with very low load, but still mongodb nodes are crashing. i.e one of the secondary node got converted into primary and started responding very slow and ultimately reached to unhealthy state, where as the node which was primary originally was not behaving unusual.

Please find some additonal stats if and let us know if you can help us out here.

Current version 3.6
Memory consumption Stats:
rs0:PRIMARY> db.serverStatus().wiredTiger.cache["maximum bytes configured"]
32212254720
rs0:PRIMARY> db.serverStatus().tcmalloc.tcmalloc.formattedString
------------------------------------------------
MALLOC:    28084039952 (26783.0 MiB) Bytes in use by application
MALLOC: +   7536099328 ( 7187.0 MiB) Bytes in page heap freelist
MALLOC: +    374013696 (  356.7 MiB) Bytes in central cache freelist
MALLOC: +      2279168 (    2.2 MiB) Bytes in transfer cache freelist
MALLOC: +    260880624 (  248.8 MiB) Bytes in thread cache freelists
MALLOC: +    114385152 (  109.1 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =  36371697920 (34686.8 MiB) Actual memory used (physical + swap)
MALLOC: +   2148909056 ( 2049.4 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =  38520606976 (36736.1 MiB) Virtual address space used
MALLOC:
MALLOC:         610748              Spans in use
MALLOC:            449              Thread heaps in use
MALLOC:           4096              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

Error log Before secondary transiting into Primary:

2021-05-02T18:29:58.804+0000 I COMMAND  [conn448567] command admin.$cmd command: isMaster { ismaster: 1, $clusterTime: { clusterTime: Timestamp(1619980198, 12), signature: { hash: BinData(0, ), keyId: } }, $db: "admin", $readPreference: { mode: "primary" } } numYields:0 reslen:678 locks:{} protocol:op_msg 572ms
2021-05-02T18:29:58.804+0000 I NETWORK  [listener] connection accepted from ip:54178 #457578 (916 connections now open)
2021-05-02T18:29:58.805+0000 I NETWORK  [conn457577] received client metadata from ip:54176 conn457577: { driver: { name: "mongo-java-driver|legacy", version: "3.11.2" }, os: { type: "Linux", name: "Linux", architecture: "amd64", version: "4.14.219-119.340.amzn1.x86_64" }, platform: "Java/Eclipse OpenJ9/1.8.0_252-b09" }
2021-05-02T18:29:58.805+0000 I COMMAND  [conn457572] commanddbname.$cmd command: saslContinue { saslContinue: 1, conversationId: 1, payload: BinData(0, ), $db: "dbname" } numYields:0 reslen:203 locks:{} protocol:op_query 1018ms
2021-05-02T18:29:58.805+0000 I COMMAND  [conn457573] commanddbname.$cmd command: saslContinue { saslContinue: 1, conversationId: 1, payload: BinData(0, ), $db: "dbname" } numYields:0 reslen:203 locks:{} protocol:op_query 1018ms

Data size: 94GB
RAM:64GB
cache : 30 GB
instance size: r4.2xlarge

@kevinadi Please let me know if you can guide us here.

Now that you are back on 3.6 and still experiencing the same issue I think you need to focus your attention to the platform.

Review and ensure you are implementing the recommendations in the Productions Notes and Operations Checklist, specifically AWS EC2

Review your mongod logs and check for warnings at startup and fix them.

The whole host is ‘crashing’ or mongod ?

Are the hosts dedicated for mongodb or are the other softwares co-located on the host.

How does other metrics of the host look, Load Average, IO Stat, CPU Usage ?

Also check dmesg for OOM.

1 Like

HI @chris

Now we have done AMI restore, and server is working as expected, but we have still not reached on any conclusion what effect did upgrade to 4.0 and downgrade activity performed on mongodb environment.

Please find the stats before AMI backup:
CPU: Fluctuating between 50% to 6%
Load average per 1 min: 29.5

Mongod reached in unhealthy/ not reachable state .

That is a normalised load average of 3.69 (29.5/8vcpu) This should below 1 on a healthy system. Given that the actual cpu usage you reported is low you’re bottlenecked elsewhere. This could be memory pressure and swapping to disk(without swap you get very high load before OOM killer) or your disk IO is hitting a limit.

@chris With same load average , server was running fine till upgrade to downgrade activity. Now as we have done AMI restore server is running fine again.
Is it the case during upgrade to 4.0 and downgrade some configuration might have changed ?
Is there any way to figure this out?
Now we are not sure even to proceed with upgrade process to 4.0, because on production environment it will be at risk.

Please help us here.

You may think it is running fine. Its not. A load average that high you are under resourced somewhere. In my opinion the changes you made(upgeade and downgrade) are highlighting this issue, not causing it.

1 Like