Primary node in replica set down and 2 weeks of data lost

Hello,

I am facing an issue with my MongoDB cluster. While noticing some degraded performance, the primary node of the replica set crashed and never restarted. After another node was elected as primary I have noticed that a lot of data is gone.

I have checked the backups (mongo dump and oplogs) and no trace of data was found for the last days.
I the mongo logs I have noticed a lot of errors like this:
[LogicalSessionCacheRefresh] Failed to refresh session cache: WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true } at rs1\r\n"
Those errors coincide with the dates where the data was lost.

Is there a way to recover the lost data? And why the data was not synced from memory to disk?

Thank you!

What is the:

  • OS
  • MongoDB Version
  • Topology

Hello Chris,

The cluster is dockerized, the MongoDB version is 4.0, container OS is Ubuntu 16.04, with persistent disks.
The topology is as follows:
1 mongos
1 replica set config servers (1 primary 2 secondaries)
2 Replica set shared clusters ( 1 primary and 2 secondaries for each RS)

Hi @Erengroth_N_A

It’s been a while since your last post. Is this still an issue for you?

I have checked the backups (mongo dump and oplogs) and no trace of data was found for the last days.

If you’re using PSS (primary with 2 secondaries) setup in the whole cluster, all write should appear in the oplog in a functioning replica set.

Is there a way to recover the lost data? And why the data was not synced from memory to disk?

Data is synced to disk every 60 seconds by WiredTiger.

I’m not sure I understand the whole picture here. If this is still an issue, could you post:

  1. Your MongoDB version (output of db.version() from the mongo shell, for all nodes in the cluster)
  2. A detailed description of your topology
  3. Description of the docker environment (settings, command line parameters, etc.)
  4. What exactly happened in detail

Best regards
Kevin

Hello @kevinadi and thank you for the reply.

We have managed to partially recover the lost data from other internal sources, but we still have data from 2 collections that is lost.
Right now the clusters are working in normal parameters, but any advices or clues about what happened and if we can still recover any additional data are more than welcome.

  1. The MongoDB version used is : v4.0.6 and v4.0.20 (v4.0.20 is only on the second mongos, v4.0.6 is across all nodes)

  2. The topology is as follows:

2 load balancers that each point to a different mongos for failover and to avoid the cursor errors.
1 Replica Set config server (PSS)
1 Replica set with sharded DBs named rs1 (PSS)
1 Replica Set with sharded DBs named rs2 (PSS)

The exact diagram can be found here, the only difference being that we use 2 load balancers, each pointing to a single mongos:


https://www.percona.com/blog/wp-content/uploads/2017/10/Load-Balanced.png

The sharding status is:

--- Sharding Status ---
  sharding version: {
        "_id" : 1,
        "minCompatibleVersion" : 5,
        "currentVersion" : 6,
        "clusterId" : ObjectId("5c76ce009da1de19549a8f08")
  }
  shards:
        {  "_id" : "rs1",  "host" : "rs1/10.42.100.198:27017,10.42.119.126:27017,10.42.130.8:27017",  "state" : 1 }
        {  "_id" : "rs2",  "host" : "rs2/10.42.129.166:27017,10.42.235.199:27017,10.42.253.76:27017",  "state" : 1 }
  active mongoses:
        "4.0.20" : 1
        "4.0.6" : 1
  autosplit:
        Currently enabled: yes
  balancer:
        Currently enabled:  yes

The startup command for the mongo nodes is: mongod --keyFile /pathto/MONGODB_KEYFILE --replSet RS_NAME --shardsvr --dbpath /data/db --port 27017 --bind_ip_all
Right now the cluster is working normally I do not think that any rs status confs can be relevant, but if needed I can post them later in the thread.

  1. The docker containers used for the MongoDB cluster are all bound to specific IPs and set to restart always.
    The docker and OS configuration is the default one, but no tuning was necessary, or at least atm due to the hardware capabilities of the servers.
    The OS is Ubuntu 16.04 LTS (quite old but an upgrade is planned)

  2. It all started with a network communication error, which led to one of the nodes in the replica set to be promoted as primary. Since the node was promoted, the data was not saved on the disk.
    We have internal log monitoring, but at that point, we did not had any alerts set for replication errors. We noticed an increase in response time on one of our apps and when we started digging we found the following errors in the MongoDB logs:

{ W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true }\r\n","stream":"stdout"}
{ I CONTROL  [LogicalSessionCacheRefresh] Failed to refresh session cache: WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true } at rs1\r\n","stream":"stdout"}
{ I CONTROL  [LogicalSessionCacheRefresh] Failed to refresh session cache: WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true } at rs1\r\n","stream":"stdout"}

All the errors were on the rs1 replica set, the rs2 replica set was working normally. Having multiple backups (full dump once a day, incremental every 15 minutes and disk snapshots),
I have decided to promote to primary another node and add the missing data afterwards. After the promotion command was given the container restarted unexpectedly. The node in question was promoted to primary, but it was missing the data since the last node was promoted before him.
Checking the backups and the oplogs I noticed the missing data was present also there, so the data was not saved in the oplog.
I have restored the data based on a disk snapshot, same result, the only thing that i have noticed was the large WiredTigerLAS.wt file (several GB), so I assume that the missing data must be there. Is there a way to retrieve the data from the WiredTigerLAS.wt file?

Thank you and sorry for the long post.

Hi @Erengroth_N_A

Nothing seems to be out of the ordinary in your setup, although I would recommend you to keep the minor versions identical to prevent any unforeseen issues.

One detail that seems to be unclear is, due to the use of docker, how are you running the servers in Shard 1, Shard 2, and Config servers? Are they e.g. contained within one server, that is, are you running 3 or more containerized mongod inside one physical server?

For example, are you running the three dockerized mongod for Shard 1 within one server, three dockerized mongod for Shard 2 within another server, and the same with the config servers? (E.g. total for the cluster is 3 servers instead of 9).

If you have this setup, I would recommend you to move to actual 9 servers. This is because running multiple mongod within a single server (with or without docker) could create heavy resource contention. That is, if Shard 1 receive a write, the server must write it 3 times instead of just once, and each mongod in that server would fight for access to disk & memory between themselves. This situation is very much less than ideal and brings me to my next point:

the only thing that i have noticed was the large WiredTigerLAS.wt file (several GB), so I assume that the missing data must be there. Is there a way to retrieve the data from the WiredTigerLAS.wt file?

In most cases, the existence of a large WiredTigerLAS.wt file indicates that the hardware (usually disk) cannot keep up with the work it supposed to do. The “LAS” file acts basically like swap on Linux systems and will be removed upon restart. If the size of the LAS file is especially large (in multiple GB range), then the disk is horribly behind on its writes, and the server basically cannot keep up with the work given to it.

Checking the backups and the oplogs I noticed the missing data was present also there, so the data was not saved in the oplog.

What write concern setting are you using? If you are using a PSS setup, I would recommend to always use write concern “majority”.

The “majority” write concern provides many benefits, such as:

  • An assurance that the write was written to 2 out of 3 nodes, so it will not be rolled back in case of failures. There is also much less chance of missing writes since you have 2 copies of the data safely stored.
  • Can act as a backpressure mechanism, so you’re not swamping the servers with more writes than it can handle.

Best regards
Kevin

1 Like

Thank you for the clarifications @kevinadi

All the dockerized mongod run on separate servers and indeed the write concern used is “majority”.

About the WiredTigerLAS.wt file, isn’t a known way to retrieve the data stored within it?

Thank you!

Hi @Erengroth_N_A

About the WiredTigerLAS.wt file, isn’t a known way to retrieve the data stored within it?

Not that I know of. The file is basically cache spill, so it’s tangential to data persistence to disk.

The WiredTiger cache contains data that is being worked on. In some cases if the workload is too much for the hardware, the cache can grow large and won’t fit in the designated WT cache size anymore. At this point, the cache content are spilled to disk in the form of the LAS file.

One example for cache growth is when you open a long-life cursor for some query. While this cursor is open, WiredTiger’s MVCC will provide the cursor with the current data state. If writes are ongoing to the data affected by the cursor, this older version of the data view is kept in the cache to serve that cursor.

Writes are persisted immediately in WiredTiger journal files (using j:true write concern setting, which is implied with w:majority), and every 60 seconds to the data files by WiredTiger. Note that this will depend on how busy/overworked the disk and the server is. If the server/disk is very busy servicing work, then it may take a bit longer to persist everything. If this situation occurs often enough, it’s probably a sign that the hardware needs upgrading for the workload.

However, in a PSS setup, once your majority write is acknowledged, it should be in the journal (thus persisted) in 2 nodes. This write will not be lost due to rollbacks, crashes, or similar issues. This scenario is extensively tested in all versions and is part of the guarantees provided by majority writes.

One case where it’s possible to have lost writes even using w:majority and j:true in a PSS setup is when the OS or the hardware “lied” to WiredTiger. WiredTiger gives instruction to the hardware to persist writes, and wait for the acknowledgement that this is done. If for some reason the hardware say it wrote the data but actually didn’t, it could lead to lost data.

Best regards
Kevin