Hello @kevinadi and thank you for the reply.
We have managed to partially recover the lost data from other internal sources, but we still have data from 2 collections that is lost.
Right now the clusters are working in normal parameters, but any advices or clues about what happened and if we can still recover any additional data are more than welcome.
-
The MongoDB version used is : v4.0.6 and v4.0.20 (v4.0.20 is only on the second mongos, v4.0.6 is across all nodes)
-
The topology is as follows:
2 load balancers that each point to a different mongos for failover and to avoid the cursor errors.
1 Replica Set config server (PSS)
1 Replica set with sharded DBs named rs1 (PSS)
1 Replica Set with sharded DBs named rs2 (PSS)
The exact diagram can be found here, the only difference being that we use 2 load balancers, each pointing to a single mongos:
https://www.percona.com/blog/wp-content/uploads/2017/10/Load-Balanced.png
The sharding status is:
--- Sharding Status ---
sharding version: {
"_id" : 1,
"minCompatibleVersion" : 5,
"currentVersion" : 6,
"clusterId" : ObjectId("5c76ce009da1de19549a8f08")
}
shards:
{ "_id" : "rs1", "host" : "rs1/10.42.100.198:27017,10.42.119.126:27017,10.42.130.8:27017", "state" : 1 }
{ "_id" : "rs2", "host" : "rs2/10.42.129.166:27017,10.42.235.199:27017,10.42.253.76:27017", "state" : 1 }
active mongoses:
"4.0.20" : 1
"4.0.6" : 1
autosplit:
Currently enabled: yes
balancer:
Currently enabled: yes
The startup command for the mongo nodes is: mongod --keyFile /pathto/MONGODB_KEYFILE --replSet RS_NAME --shardsvr --dbpath /data/db --port 27017 --bind_ip_all
Right now the cluster is working normally I do not think that any rs status confs can be relevant, but if needed I can post them later in the thread.
-
The docker containers used for the MongoDB cluster are all bound to specific IPs and set to restart always.
The docker and OS configuration is the default one, but no tuning was necessary, or at least atm due to the hardware capabilities of the servers.
The OS is Ubuntu 16.04 LTS (quite old but an upgrade is planned)
-
It all started with a network communication error, which led to one of the nodes in the replica set to be promoted as primary. Since the node was promoted, the data was not saved on the disk.
We have internal log monitoring, but at that point, we did not had any alerts set for replication errors. We noticed an increase in response time on one of our apps and when we started digging we found the following errors in the MongoDB logs:
{ W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true }\r\n","stream":"stdout"}
{ I CONTROL [LogicalSessionCacheRefresh] Failed to refresh session cache: WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true } at rs1\r\n","stream":"stdout"}
{ I CONTROL [LogicalSessionCacheRefresh] Failed to refresh session cache: WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true } at rs1\r\n","stream":"stdout"}
All the errors were on the rs1 replica set, the rs2 replica set was working normally. Having multiple backups (full dump once a day, incremental every 15 minutes and disk snapshots),
I have decided to promote to primary another node and add the missing data afterwards. After the promotion command was given the container restarted unexpectedly. The node in question was promoted to primary, but it was missing the data since the last node was promoted before him.
Checking the backups and the oplogs I noticed the missing data was present also there, so the data was not saved in the oplog.
I have restored the data based on a disk snapshot, same result, the only thing that i have noticed was the large WiredTigerLAS.wt file (several GB), so I assume that the missing data must be there. Is there a way to retrieve the data from the WiredTigerLAS.wt file?
Thank you and sorry for the long post.