How to avoid frequent failovers

Gowtham_Raj_Elangova · April 9, 2020, 5:08pm

Hello all,

While deploying MongoDB 4.2.0 replica set - lets say a 3 node cluster onto on-prem VM systems. Its quite possible to have occasional network glitches. I was faced with a situation where in we had frequent failovers due to network glitches.

I tried to check online docs as to how we can make MongoDB tolerant to network glitches.
I found that we have to disable “enableElectionHandoff”. Once we do that mongoDB respects
“settings.electionTimeoutMillis” - default 10 seconds.

Lets say node A goes down, then it takes 10 seconds to decide who must be the next primary. So after 10 seconds, lets say node B conducts election and becomes primary. So disabling “enableElectionHandoff” works well.

Lets take a situation where NodeA suffers network glitch, and its not visible to Node B and Node C and it comes back online after 5 seconds. Now I expect Node A to become primary automatically. But what happens is that Node A joins back, now all the 3 nodes are in secondary mode. At the end of 10 seconds, Node B becomes primary and not Node A.

My initial assumption was that I have made mongoDB tolerant to network failures. But thats not seem to be happening here. Either ways with or without “enableEletionHandoff” we have a situation where Node B becomes primary and we have to manually failback.

How do we deal with this? How do we make mongodb tolerant to network errors.

chris · April 9, 2020, 6:04pm

If you have a preference for node that become Primary use priority in the replSet config.

Gowtham_Raj_Elangova · April 14, 2020, 12:12pm

Even if i use priority, there happens a case where low priority members become primary for some time - lets say 2 mins, post which election happens again and the member with higher priority takes over.