Mongo gradually allocates all free system memory after start and than stops responding

Timur_Pyatkov · March 15, 2021, 11:23am

Hello, everyone. I need help with the following problem:

Initially, the database was launched with the replicaset mode, but with single member. It worked properly (database size is about 7 Tb). Then I added the members (replica and arbiter). After a few hours of normal functioning and active replication to the secondary, the Master began to utilize disk and RAM strongly, almost in 30 minutes the disk utilization values for iops and read/write BW increased several times and reached the maximum possible values of the subsystem of the hypervisor. Such behavior had too strong impact on the production so I disconnected the replica (and arbiter) and turned the database configuration to the single member again (by means of rs.reconfig). The data from DB was accessible, but the disk utilization did not decrease and after several minutes the DB stopped responding and was terminated incorrectly by OOMKiller (k8s). During the next DB start it was reading oplog (for about 1,5 hours) and correctly started, the data in the collections was available. But the problem with utilization preserved. More of it, immediately after the start, it constantly caches something in RAM and eventually allocates it entirely, even if the cache limits are set. As a result, the database cannot be terminated correctly and on the next startup it processes oplog again.

Is there any solution for such problem?

DB settings:

MongoDB version: 4.0.20
Memory limit.

    resources:
      limits:
        memory: 60Gi
      requests:
        memory: 4Gi

    extraFlags:
     - "--wiredTigerCacheSizeGB=29.5"

In my case the particular value of “wiredTigerCacheSizeGB” only influences on how fast the DB reaches the unresponding state. Db absorbs 80% of system memory limit RAM for about 15-20 minutes and then gradually reaches 100%.

db.serverStatus().wiredTiger.cache

Very strange that “bytes dirty in the cache cumulative” shows 13078239297687 bytes = 13Tb (about twice size of the data itself).

"cache" : {
      "application threads page read from disk to cache count" : 1342764,
      "application threads page read from disk to cache time (usecs)" : 4619621496,
      "application threads page write from cache to disk count" : 436685,
      "application threads page write from cache to disk time (usecs)" : 54931268,
      "bytes belonging to page images in the cache" : 315569,
      "bytes belonging to the cache overflow table in the cache" : 275342,
      "bytes currently in the cache" : 94026788363,
      "bytes dirty in the cache cumulative" : NumberLong("13078239297687"),
      "bytes not belonging to page images in the cache" : 94026472793,
      "bytes read into cache" : 584933873969,
      "bytes written from cache" : 170461214284,
      "cache overflow cursor application thread wait time (usecs)" : 82704000,
      "cache overflow cursor internal thread wait time (usecs)" : 0,
      "cache overflow score" : 37,
      "cache overflow table entries" : 591169,
      "cache overflow table insert calls" : 8793587,
      "cache overflow table max on-disk size" : 0,
      "cache overflow table on-disk size" : 121216278528,
      "cache overflow table remove calls" : 8287880,
      "checkpoint blocked page eviction" : 5081,
      "eviction calls to get a page" : 275177050,
      "eviction calls to get a page found queue empty" : 16456913,
      "eviction calls to get a page found queue empty after locking" : 2774495,
      "eviction currently operating in aggressive mode" : 100,
      "eviction empty score" : 100,
      "eviction passes of a file" : 7554413,
      "eviction server candidate queue empty when topping up" : 2536340,
      "eviction server candidate queue not empty when topping up" : 241258,
      "eviction server evicting pages" : 0,
      "eviction server slept, because we did not make progress with eviction" : 3403555,
      "eviction server unable to reach eviction goal" : 3607,
      "eviction server waiting for a leaf page" : 9376093,
      "eviction server waiting for an internal page sleep (usec)" : 0,
      "eviction server waiting for an internal page yields" : 1093,
      "eviction state" : 107,
      "eviction walk target pages histogram - 0-9" : 1870311,
      "eviction walk target pages histogram - 10-31" : 3038674,
      "eviction walk target pages histogram - 128 and higher" : 0,
      "eviction walk target pages histogram - 32-63" : 605907,
      "eviction walk target pages histogram - 64-128" : 2039521,
      "eviction walk target strategy both clean and dirty pages" : 2130362,
      "eviction walk target strategy only clean pages" : 44580,
      "eviction walk target strategy only dirty pages" : 5379471,
      "eviction walks abandoned" : 6813,
      "eviction walks gave up because they restarted their walk twice" : 1213802,
      "eviction walks gave up because they saw too many pages and found no candidates" : 24685,
      "eviction walks gave up because they saw too many pages and found too few candidates" : 1174,
      "eviction walks reached end of tree" : 3742597,
      "eviction walks started from root of tree" : 1248697,
      "eviction walks started from saved location in tree" : 6305716,
      "eviction worker thread active" : 4,
      "eviction worker thread created" : 0,
      "eviction worker thread evicting pages" : 198482643,
      "eviction worker thread removed" : 0,
      "eviction worker thread stable number" : 0,
      "files with active eviction walks" : 1,
      "files with new eviction walks started" : 2528795,
      "force re-tuning of eviction workers once in a while" : 0,
      "forced eviction - pages evicted that were clean count" : 283442,
      "forced eviction - pages evicted that were clean time (usecs)" : 1180313,
      "forced eviction - pages evicted that were dirty count" : 63987,
      "forced eviction - pages evicted that were dirty time (usecs)" : 5217743,
      "forced eviction - pages selected because of too many deleted items count" : 17,
      "forced eviction - pages selected count" : 529024,
      "forced eviction - pages selected unable to be evicted count" : 144086,
      "forced eviction - pages selected unable to be evicted time" : 30387217,
      "hazard pointer blocked page eviction" : 331945,
      "hazard pointer check calls" : 252876663,
      "hazard pointer check entries walked" : 3677035589,
      "hazard pointer maximum array length" : 4,
      "in-memory page passed criteria to be split" : 280400,
      "in-memory page splits" : 37553,
      "internal pages evicted" : 326505,
      "internal pages queued for eviction" : 80139840,
      "internal pages seen by eviction walk" : 114633601,
      "internal pages seen by eviction walk that are already queued" : 25123170,
      "internal pages split during eviction" : 23,
      "leaf pages split during eviction" : 49087,
      "maximum bytes configured" : 3221225472,
      "maximum page size at eviction" : 4258367,
      "modified pages evicted" : 111674149,
      "modified pages evicted by application threads" : 0,
      "operations timed out waiting for space in cache" : 0,
      "overflow pages read into cache" : 0,
      "page split during eviction deepened the tree" : 1,
      "page written requiring cache overflow records" : 266818,
      "pages currently held in the cache" : 35579,
      "pages evicted by application threads" : 54387447,
      "pages queued for eviction" : 253618652,
      "pages queued for eviction post lru sorting" : 262981810,
      "pages queued for urgent eviction" : 2995874,
      "pages queued for urgent eviction during walk" : 45540,
      "pages read into cache" : 4620547,
      "pages read into cache after truncate" : 842,
      "pages read into cache after truncate in prepare state" : 0,
      "pages read into cache requiring cache overflow entries" : 165182,
      "pages read into cache requiring cache overflow for checkpoint" : 162023,
      "pages read into cache skipping older cache overflow entries" : 10,
      "pages read into cache with skipped cache overflow entries needed later" : 10,
      "pages read into cache with skipped cache overflow entries needed later by checkpoint" : 0,
      "pages requested from the cache" : 54419469,
      "pages seen by eviction walk" : 387794142,
      "pages seen by eviction walk that are already queued" : 43766341,
      "pages selected for eviction unable to be evicted" : 136449662,
      "pages selected for eviction unable to be evicted as the parent page has overflow items" : 0,
      "pages selected for eviction unable to be evicted because of active children on an internal page" : 78742275,
      "pages selected for eviction unable to be evicted because of failure in reconciliation" : 3986,
      "pages selected for eviction unable to be evicted due to newer modifications on a clean page" : 0,
      "pages walked for eviction" : 17645569248,
      "pages written from cache" : 1398583,
      "pages written requiring in-memory restoration" : 111527631,
      "percentage overhead" : 8,
      "tracked bytes belonging to internal pages in the cache" : 13566324,
      "tracked bytes belonging to leaf pages in the cache" : 94013222039,
      "tracked dirty bytes in the cache" : 956433,
      "tracked dirty pages in the cache" : 30,
      "unmodified pages evicted" : 4203659
    }

Right now I moved DB from k8s installation to docker-compose with mongodb v4.0.23 for recovery procedures, but I limited the system resorces in docker severly to let the production work without influence, so now the oplog check process will take about 40 hours.

Timur_Pyatkov · March 16, 2021, 12:26pm

Another interesting fact is that the oplog size is larger than “maxSize”:

	“ns” : “local.oplog.rs”,
	“size” : 119937167105,
	“count” : 1888822,
	“avgObjSize” : 63498,
	“storageSize” : 122149011456,
	“capped” : true,
	“max” : -1,
	“maxSize” : NumberLong(“52560538112”),
	“sleepCount” : 0,
	“sleepMS” : 0

Timur_Pyatkov · March 18, 2021, 2:44pm

If I start mongo without replicaset it works ok and almost all the data is accessable, but some data with the last several hours of received traffic are lost and exists only in the oplog. Oplog is corrupted, but I can get access to entries anyway. The main problem that each request takes too much time. Is it safe to make mongodb repair in that case and not to loose data in the oplog?

Timur_Pyatkov · March 22, 2021, 11:08am

I managed to export oplog with MongoDB Compss (mongoexport didn’t help). The real size of the oplog is more than 160Gb and it’s not because it was compressed in the DB. Compass was showing the same size of 120Gb and its progress bar was 100% full exactly on that point of 120Gb for several ours till ending.
So I parsed the oplog, found all the relevant data and uploaded it to the db and collections where they should be.

kevinadi · March 22, 2021, 11:31pm

Hi @Timur_Pyatkov welcome to the community!

There are a lot of information to unpack here, but if I understand correctly, here is the situation:

You are attempting to add a replica set member into an existing single-node replica set. However this fails because the current member is overloaded or OOMkilled while it’s trying to do initial sync.
The deployment is being used in a production capacity while the initial sync is progressing.
The deployment works well as a single-node replica set. The issues only start happening after adding a new replica set member.

Are those accurate descriptions of the issue you’re seeing?

If yes, it seems that your deployment’s resources are just enough to service your production load but cannot handle the load of production+initial sync. The initial sync process is relatively resource intensive, since it requires the server to do a collection scan (i.e. read everything into memory) to send it to the syncing secondary.

The WiredTiger metrics numbers you posted pointed to a heavily overloaded system. At these numbers, I would expect to see the database to stop responding to inputs since it has such a large backlog of work to do, and the underlying hardware cannot service the work (e.g. read & write from disk) fast enough. I believe you observed this stalling as well.

Since the disk cannot service the work fast enough, MongoDB was forced to hold onto an increasing amount of pending work in memory, resulting in increasing memory usage until it was killed by the OOMkiller.

I can think of some possible ways to move forward:

Increase the hardware provisioning for the server until it can handle the production+initial sync load without stalling. Typically this means a much larger memory and faster disks. This may be the simplest, most straightforward method.
Stop production workload while the initial sync is progressing, which could take a while since you have a large amount of data.
Follow the procedure to Sync by Copying Data Files from Another Member. Note that this requires downtime while file copying is happening, and is the most complicated method of the three.

On another note, oplog size can actually go beyond its maxSize (see Replica Set Oplog) if it’s required for the correct working of the server. This is one sign that the server is overloaded.

Best regards,
Kevin

kevinadi · March 23, 2021, 11:40pm

Hi @Timur_Pyatkov

Thinking a bit further about your issue, I seem to totally neglect the fact that you are running a Primary-Secondary-Arbiter replica set in my previous post.

Note that since the introduction of read concern majority that’s needed to support change streams, the P-S-A architecture may incur an additional memory load under certain conditions, due to the inability of the Arbiter to acknowledge writes.

As an overview, read concern majority guarantees that the data read has been acknowledged by a majority of the replica set members (i.e. the documents read are durable and guaranteed not to roll back).

In a steady-state replica set situation, a write is performed to the Primary, the Secondary followed, and the “majority commit point” is updated. If the Secondary is doing an initial sync, it cannot acknowledge writes, thus the “majority commit point” cannot be advanced, and this will put additional pressure to the Primary’s memory usage. There are some ways to improve this situation:

Stop all writes while the initial sync is happening
Sync by copying data files
Disable majority read concern

Note that although disabling read concern majority may relieve some memory pressure off the Primary, it is not a silver bullet if the Primary is under-provisioned to begin with.

Please note that although the PSA (Primary Secondary Arbiter) configuration is supported, new features of MongoDB such as change streams, transactions, etc. work best without an Arbiter. Thus the PSS (Primary Secondary Secondary) configuration is strongly recommended for new deployments.

Best regards,
Kevin

Timur_Pyatkov · July 14, 2021, 1:35pm

Hi, @kevinadi.

I’m sorry I didn’t reply in time, but in that period my prior task was to restore data and solve the issue.
In the end your answers really helped. I increased memory limits for DBs and moved from PSA to PSS configuration. The switch form PSA to PSS was the most effictive step, because now the initial sync works without problems even under a high workload. So now I use PSS for all MongoDBs in that k8s cluster even for dbs with low load.

Thank you very much for your help!

Best regards,
Timur.