Removing and re-adding a replica set member

Hi all,

I’m new to MongoDB and I have a question about Replica sets. At the moment, I have 2 sites and 4 mongodb databases in replica set; one primary, one secondary in site A and one secondary and one observer in site B.

For reasons, I have to break the comms between the sites and turn the secondary in site B to act as a primary - write data to it. I will have to rejoin the sites and re-establish the replica set later.

My question is if I remove a member (secondary in site 2) from the replica set, break the comms between site A and B and then write data to the secondary database in site B, then re-add the member to the replica set to be replicated from the primary database in site A, would there be any data corruption?

Thanks in advance

Welcome to the community @Oscar_A!

All members of a replica set share a common write history.

You can remove a member from a replica set and use that in standalone mode or as the seed for another replica set, but that creates a divergent write history. You cannot re-add that member to the original replica set without re-syncing or rolling back the writes (which would undo your goal of writing data to this member).

If you want to merge data back into a single replica set, you will have to backup and restore the relevant data. Merging may be difficult if you have updated the same collections (or documents!) in different deployments.

Regards,
Stennie

Hi Stennie,

Thanks for your reply.

Yes the problem is the primary and one secondary will continue to work in site A, data being written to the primary. Then in site B after the link to site A is broken, data will also be written to the ex-secondary-now-primary database.

However, data written to site B primary is of zero importance because I am doing this only for testing purposes for site B.

So if I did the above and re-add that primary from site B to the original replica set, would it automatically be synced with the original primary from site A?

Thanks

No, as per my earlier comment you’ve now created two replica sets that happen to have the same name but have diverged in history. Any independent writes on site B cannot be automatically reconciled with site A. There can only be a single primary (and timeline of write history) for a given replica set.

If you re-added members from B to replica set A and they still had an oplog entry in common, the members rejoining as a secondary would attempt to rollback to the common point (before history diverged) to make the members consistent. Documents that are rolled back are exported to BSON for manual reconciliation.

If rollback isn’t possible, the members would have to be re-synced.

If you want to merge data, you need to identify the changed data and use mongodump and mongorestore to load the relevant data into a single deployment.

Regards,
Stennie

Thanks Stennie,

" If you re-added members from B to replica set A and they still had an oplog entry in common, the members rejoining as a secondary would attempt to rollback to the common point (before history diverged) to make the members consistent."

Does this process occur independently? Or does it need human interaction?

“Documents that are rolled back are exported to BSON for manual reconciliation.”

I’m confused. which documents? Those in the original PRIMARY node or those in fake PRIMARY node? Are you talking about the documents that are written into the fake PRIMARY node or the ones in original PRIMARY node?

" If rollback isn’t possible, the members would have to be [re-synced]"

Automatically or via human interaction?

Thanks

@Oscar_A To be clear, I would not recommend the approach you are trying to take (splitting and attempting to merge two versions of a replica set). If you want to preserve all writes, manual effort will be required. Replica sets intentionally only allow a single primary and history of changes: breaking comms between members of a replica set does not change that fundamental requirement.

Instead of merging members with different replica set histories, I would use mongodump and mongorestore to backup and restore the relevant data into a single replica set (assuming you have some way of identifying which data has been modified).

The rollback process (which returns a replica set member to a consistent state) is attempted automatically, but the recovery of conflicting documents (which get exported as BSON files to a rollback directory) requires human intervention.

Rollback is generally not desirable, especially if your application is expecting that data that has been written will not be reverted. For most use cases the goal is to Avoid Replica Set Rollbacks.

A simplified example:

  • Imagine you have updated fields in a document with on both replica set A and B (which originally were a single replica set).
  • You re-add memberB from replica set B to A.
  • Assume the primary remains a member from replica set A. NOTE: depending on how you reconfigured your replica sets and how their histories diverged, replica set B could have a newer replica set version and trigger A to rollback instead.
  • If memberB no longer has an oplog entry in common with primaryA, it will be stuck in RECOVERING state and require manual intervention (re-sync)
  • If memberB still has an oplog entry in common with primaryB, the rollback process will attempt to export all documents that have changed since that common point. These are the versions of the documents that are written to BSON rollback files.
  • MemberB will fetch the current version of documents that were rolled back from primaryA.

If the rollback process is successful, you then have to figure out what to do with the BSON files in the rollback data directory You will have the current version of the document in replica set A and the version exported from replicaB to a BSON file, but there may have been conflicting updates in the two different replica sets.

Reconciling the rollback files will be a manual process.

I’m referring to the BSON files exported in the rollback directory.

Re-sync requires human interaction. You need to decide which approach you want to use to Resync a member of a replica set. Since re-sync also involves removing the existing data, you would not want this to happen automatically.

Regards,
Stennie

Stennie, again thanks for your reply.

I want to take this approach only to test something. Yes I want to re-add the fake Primary node into the original Replica Set and I don’t care about the data being written to it. I don’t have to retain what was written to the fake Primary node.

I have also been doing some testing on Google Cloud environment. I have setup 4 VMs, each with 1 mongodb instance running. I have set up a Replica Set with them, using VM1 as the Primary by making it priority 4 and others 0.

I isolated VM3 from the network, simulating an unavailable node. My plan was to somehow make it a standalone Primary and write data to it as part of testing. When testing is done, I was going to add it back to the original Replica Set and see if syncs (or manual re-sync it’d doesnt matter) from the original primary node, discarding the data written from the time when it was a Primary. I don’t care what happens to the data written to it when it was out of the Replica Set.

Well I was not successful in doing so because I don’t know how to manually make it Primary or writable , assuming there is a way to do this. I’m stuck.

Are you writing ephemeral data (like a cache)? It is otherwise unusual to want write availability without caring about saving the writes.

If you’re fine discarding all writes, the safest path would be to:

  • Restart the isolated member as a standalone (with the replSetName commented out in the config file).
  • Write your data. Any writes in standalone mode will not be written to the oplog, so this violates consistency with the original replica set and the member should not be directly re-added to the original replica set.
  • Before rejoining the replica set, move (or remove) the contents of the dbPath and re-sync this secondary.

I would be very cautious messing about with trying to create two versions of the same replica set with different primaries as this may result in unexpected rollbacks or consistency problems.

Regards,
Stennie

Hi Stennie,

I am very cautious about the unexpected rollbacks and/or consistency issues too which is why I have to do this test in the first place. Due to poor infrastructure planning from the other team in the production environment, this kind of scenario have to happen for a test I am going run. Site 2 needs to be able to continue without the availability of Site 1 but because the mongodb databases from both Site 1 and 2 are comprising of single Replica Set, I need to prove that Site 2 can live on without Site 1 and the only way to do this, at this stage, is to force Site 2 Secondary to Primary.

Then of course it needs to be proven that when Site 1 is available again, the Replica Set with both Site 1 and 2 machines will continue on without any adverse effects.

“Are you writing ephemeral data (like a cache)? It is otherwise unusual to want write availability without caring about saving the writes.”

No I need to be able to write into a database in the fake Primary. This is to test a feasibility of something…

" Restart the isolated member as a standalone (with the replSetName commented out in the config file)."

In my test environment, I checked in /etc/mongod.conf and there is nothing in #replication section, in Primary and Secondary node config files. I don’t understand because replica set is established and I have been writing data to it.

" * Write your data. Any writes in standalone mode will not be written to the oplog, so this violates consistency with the original replica set and the member should not be directly re-added to the original replica set.

I need to be able to write to the existing database in the fake Primary. So if I understand you correctly, I would have to first change the dbPath before running mongod and copy the database into that different dbPath and run the standalone from it?

Can you elaborate on the scenario you are trying to test? I’m not clear on the goal, since you mention you are OK with discarding any writes to Site 2 (which seems contradictory to maintaining write availability).

In a disaster scenario where Site 1 is fully unavailable, you have the manual administrative option of force reconfiguring Site 2 to accept writes.

However, this is not intended to allow you to create two versions of the same replica set with different primaries and continuing writing to both.

If you want to enable automatic failover from Site 1 to Site 2, your replica set config should have:

  • higher priority for members in Site 1 so they will be elected if available
  • a voting member in a third data centre to allow Site 2 to form a voting majority and elect a primary if all eligible members in Site 1 are unavailable.

See: Replica sets distributed across two or more data centres.

If you allow continued writes to Site 1 and Site 2, you will have a fundamental challenge on how to reconcile writes when both sites are available (with options as described earlier in this thread).

Regards,
Stennie

2 Likes

Even if you manage to force reconfigure your B site is it worth the mess you then have to sort out when site A comes back?

I would write this up as a TL;DR to the powers that be.

We shouldn’t do this, it is a really bad idea. We should build this properly and add a third site for automatic HA/Recovery.

If we are forced to do this there is a high probability of prolonged downtime and/or data loss.

There is enough documentation and white papers to back this up.

I have just ran a test in my google cloud environment.

4 VMs, running one mongodb instances each. I have a replica set of thes 4 mongodb instances. VM1 is currently PRIMARY node. and VM2, 3 and 4 are SECONDARY. The replica set REPSET01 has one collection called RepCollection which I wrote test documents to it. the documents are replicated across all nodes.

I isolated VM3 by adding some iptables rules on it’s own iptables to drop all traffic from VM1,2 and 4. Then I wrote data to the RepCollection via VM1 PRIMARY> node. I checked the RepCollection on VM2 and VM4, the data is instantly replicated.

Then on VM3 I restarted mongod without -replSet option. I connected to itself and selected the RepCollection (I believe it became standalone as the cursor became > instead of SECONDARY>). I queried the contents db.RepCollection.find().pretty() and I saw all the documents replicated before it lost comms to the replica set exist, but not the data written to the RepCollection after I isolated it which makes sense.

I inserted a document to the RepCollection of isolated VM3, checked it and it is there. Then I stopped mongod service. Then I stopped the iptables service wiping the temporary rules so VM3 has regained comms with the other VMs and thus the replica set. I started mongo with mongod --replSet “REPSET01” -f /etc/mongod.conf

I connected to it and saw it became SECONDARY> again. Then I selected the RepCollection and queried the contents. As I expected I saw the data written to the RepCollection via VM1 after VM3 was isolated, is now replicated across VM3 node BUT to my surprise, VM3’s RepCollection ALSO retained the document written to it while in isolation state. So as far as I can understand, doing this test definitely creates dependencies. Is there a way to manually force replicate VM3 node to be the same as every other node in the Replica Set?

@Stennie

To elaborate, I have site 1 and site 2. Site 1 has 2 VMs running one mongodb instance each. Site 2 has 2 VMs running one mongodb instance each. VM1, VM2 (Site1) and VM3, VM4 (Site 2) make up the Replica Set. Site 1 is running an application suite, lets call it QueryApp1. QueryApp1 is live and it is reading and writing data to the PRIMARY node of the Replica Set which is on VM1. Site 2 also has identical application suite, lets call it QueryApp2 which is NOT live at the moment. Site 2’s QueryApp2 has not undergone a complete test and I need to do it now.

For reasons beyond my own imagination, someone has decided to go live with QueryApp1, using the replica set. Now I have to test the QueryApp2 and I cannot write anything to the PRIMARY but QueryApp2 still need a PRIMARY to read and write data to it as part of the testing. What I am trying to do is do a full functionality test on QueryApp2 without touching the actual PRIMARY.

So because the read/write activity on QueryApp2 is only for testing, the written data can be discarded when testing is completed which is why I mentioned it does not need to be kept.

I understand what I have tried to do, with asking all these stupid questions to you guys, is NOT advisable but I still need to find a way to test the QueryApp2. Is there an alternative wary to achieve this?

@Stennie and @Stennie

I manually resynced the VM3 by following the mongodb manual: stop mongodb instance on VM3, delete everything in the dbPath and start mongodb instance with replSet option.

This not only re-added the VM3 as a SECONDARY member into the Replica Set but also synced consistent data across all nodes.

This may or may not work in my environment due to the sizes of the databases - it may take a while. Your thoughts?