Hi kevinadi,
Thanks for your suggestion.
For case of OOMkiller we have checked all logs of /var/log/messages but there is no log message indicating the same. Is there any other way checking the same.
And in context to multi document transaction , we have not yet enabled feature compatability version to 4.0, so there are less chances of additional memory load.
Also , we have now downgraded to version 3.6 , and we are running with very low load, but still mongodb nodes are crashing. i.e one of the secondary node got converted into primary and started responding very slow and ultimately reached to unhealthy state, where as the node which was primary originally was not behaving unusual.
Please find some additonal stats if and let us know if you can help us out here.
Current version 3.6
Memory consumption Stats:
rs0:PRIMARY> db.serverStatus().wiredTiger.cache["maximum bytes configured"]
32212254720
rs0:PRIMARY> db.serverStatus().tcmalloc.tcmalloc.formattedString
------------------------------------------------
MALLOC: 28084039952 (26783.0 MiB) Bytes in use by application
MALLOC: + 7536099328 ( 7187.0 MiB) Bytes in page heap freelist
MALLOC: + 374013696 ( 356.7 MiB) Bytes in central cache freelist
MALLOC: + 2279168 ( 2.2 MiB) Bytes in transfer cache freelist
MALLOC: + 260880624 ( 248.8 MiB) Bytes in thread cache freelists
MALLOC: + 114385152 ( 109.1 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 36371697920 (34686.8 MiB) Actual memory used (physical + swap)
MALLOC: + 2148909056 ( 2049.4 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 38520606976 (36736.1 MiB) Virtual address space used
MALLOC:
MALLOC: 610748 Spans in use
MALLOC: 449 Thread heaps in use
MALLOC: 4096 Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
Error log Before secondary transiting into Primary:
2021-05-02T18:29:58.804+0000 I COMMAND [conn448567] command admin.$cmd command: isMaster { ismaster: 1, $clusterTime: { clusterTime: Timestamp(1619980198, 12), signature: { hash: BinData(0, ), keyId: } }, $db: "admin", $readPreference: { mode: "primary" } } numYields:0 reslen:678 locks:{} protocol:op_msg 572ms
2021-05-02T18:29:58.804+0000 I NETWORK [listener] connection accepted from ip:54178 #457578 (916 connections now open)
2021-05-02T18:29:58.805+0000 I NETWORK [conn457577] received client metadata from ip:54176 conn457577: { driver: { name: "mongo-java-driver|legacy", version: "3.11.2" }, os: { type: "Linux", name: "Linux", architecture: "amd64", version: "4.14.219-119.340.amzn1.x86_64" }, platform: "Java/Eclipse OpenJ9/1.8.0_252-b09" }
2021-05-02T18:29:58.805+0000 I COMMAND [conn457572] commanddbname.$cmd command: saslContinue { saslContinue: 1, conversationId: 1, payload: BinData(0, ), $db: "dbname" } numYields:0 reslen:203 locks:{} protocol:op_query 1018ms
2021-05-02T18:29:58.805+0000 I COMMAND [conn457573] commanddbname.$cmd command: saslContinue { saslContinue: 1, conversationId: 1, payload: BinData(0, ), $db: "dbname" } numYields:0 reslen:203 locks:{} protocol:op_query 1018ms
Data size: 94GB
RAM:64GB
cache : 30 GB
instance size: r4.2xlarge