Replication errors

Hi,

I have MongoDB 4.0 with two replicaset members.

    rs.conf()
    {
    	"_id" : "rs0",
    	"version" : 11,
    	"protocolVersion" : NumberLong(1),
    	"writeConcernMajorityJournalDefault" : true,
    	"members" : [
    		{
    			"_id" : 1,
    			"host" : "192.168.123.86:27017",
    			"arbiterOnly" : false,
    			"buildIndexes" : true,
    			"hidden" : false,
    			"priority" : 3,
    			"tags" : {
    				
    			},
    			"slaveDelay" : NumberLong(0),
    			"votes" : 1
    		},
    		{
    			"_id" : 2,
    			"host" : "192.168.123.87:27017",
    			"arbiterOnly" : false,
    			"buildIndexes" : true,
    			"hidden" : false,
    			"priority" : 1,
    			"tags" : {
    				
    			},
    			"slaveDelay" : NumberLong(0),
    			"votes" : 1
    		}
    	],
    	"settings" : {
    		"chainingAllowed" : true,
    		"heartbeatIntervalMillis" : 2000,
    		"heartbeatTimeoutSecs" : 10,
    		"electionTimeoutMillis" : 10000,
    		"catchUpTimeoutMillis" : 60000,
    		"catchUpTakeoverDelayMillis" : 30000,
    		"getLastErrorModes" : {
    			
    		},
    		"getLastErrorDefaults" : {
    			"w" : 1,
    			"wtimeout" : 0
    		},
    		"replicaSetId" : ObjectId("58764207c0fb84b262e464aa")
    	}
    }

After the initial synchronization the secondary stays days behind the primary server.

    rs.printSlaveReplicationInfo()
    source: 192.168.123.87:27017
    	syncedTo: Mon Jun 29 2020 22:59:56 GMT+0200 (CEST)
    	221205 secs (61.45 hrs) behind the primary 

There are timeout errors every five minutes in the logs:

-- primary log --
    2020-07-02T10:38:50.733+0200 I COMMAND  [LogicalSessionCacheRefresh] command config.$cmd command: update { update: "system.sessions", ordered: false, allowImplicitCollectionCreation: false, writeConcern: { w: "majority", wtimeout: 15000 }, $db: "config" } numYields:0 reslen:383 locks:{ Global: { acquireCount: { r: 1253, w: 1165 } }, Database: { acquireCount: { w: 1165 } }, Collection: { acquireCount: { w: 1165 } } } storage:{} protocol:op_msg 30651ms
    2020-07-02T10:38:50.743+0200 I CONTROL  [LogicalSessionCacheRefresh] Failed to refresh session cache: WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true }
-- secondary log --
    2020-07-02T10:39:32.575+0200 I NETWORK  [LogicalSessionCacheReap] Starting new replica set monitor for rs0/192.168.123.86:27017,192.168.123.87:27017
    2020-07-02T10:39:32.577+0200 I NETWORK  [LogicalSessionCacheReap] Successfully connected to 192.168.123.86:27017 (1 connections now open to 192.168.123.86:27017 with a 0 second timeout)
    2020-07-02T10:39:32.577+0200 I NETWORK  [LogicalSessionCacheRefresh] Successfully connected to 192.168.123.86:27017 (2 connections now open to 192.168.123.86:27017 with a 0 second timeout)
    2020-07-02T10:39:32.577+0200 I NETWORK  [LogicalSessionCacheRefresh] Starting new replica set monitor for rs0/192.168.123.86:27017,192.168.123.87:27017
    2020-07-02T10:39:32.577+0200 I NETWORK  [LogicalSessionCacheRefresh] Starting new replica set monitor for rs0/192.168.123.86:27017,192.168.123.87:27017
    2020-07-02T10:39:48.441+0200 I CONTROL  [LogicalSessionCacheRefresh] Failed to refresh session cache: WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true }

What can I do to synchronize the replica set?

wbr Tomaz

Hi Tomaz,

I’m not sure why your replication state is like this, although it’s been some time since you posted this question. Are you still having this issue?

If yes, could you post:

  • your MongoDB version
  • output of rs.status()
  • output of rs.printReplicationInfo()
  • how long does the initial sync last?

And also please describe the hardware provisioned for the two nodes.

Note that having an even number replica set nodes is not a recommended configuration. It is recommended to have at least three nodes for high availability purposes. Please see Replica Set Deployment Architectures for more information.

Best regards,
Kevin

Hi Kevin,
The MongoDB version is 4.0.19 running on Ubuntu 18.04 and it now doesn’t have this errors any more. I think the system was just overloaded, the load average was almost at 5. The rs.status() showed heartbeat was working but the rs.printReplicationInfo() stated that secondary is more than 60 hours behind.
Could the reason be that the total index size is higher than system memory (64GB) and MongoDB is continusly reloading indices?
Thanks for your suggestions.
wbr Tomaz