How do I migrate a large db from zlib to zstd?

My testing shows that zstd will give me better compression as well as faster database access times, so I’d like to upgrade to 4.2+ and migrate to the new compression algorithm.

Server is an unsharded 3 member replica set running v3.6.8 CE. Database is ~4.5TB, about 1/2 of which is indexes. Data size is ~16TB. The hardware probably doesn’t matter too much, but the primary is a 96 core 512GB memory epyc server with nvme drives on software raid.

I understand the incremental upgrade process from v3.6 to v4.2+, but what is the process to change the database compression while minimizing downtime? The database is far too large for a simple dump/restore. Replica restore from scratch is extremely slow and the oplog isn’t nearly big enough. Current oplog size of 150GB only shows ~7hrs, while a full replica restore seems like it takes days if not weeks.

I’m guessing that this is going to involve some combination of speeding up the replica sync and increasing the oplog. I would appreciate any thoughts or suggestions.

I don’t think this should matter on v3.6 see the below link/quote. If it does, this should worry you as this is the exact same process you’d need to follow if a member completely dies and need to be sync’d from scratch.

Changed in version 3.4: Initial sync pulls newly added oplog records during the data copy. Ensure that the target member has enough disk space in the local database to temporarily store these oplog records for the duration of this data copy stage.

If oplog does indeed limit you then yes. Increase it.

Edit:

A restore from scratch would be the approach. If go to 4.4 you can specify initialSyncSourceReadPreference which may increase the initial sync performance if the primary load is impacting the your current throughput.

Probably time to setup sharding, and/or clean up some indexes.

Hi Chris - thanks for taking the time to look at this question.

A quick clarification: I’m not too concerned about the undersized oplog in my daily operations because I use lvm snapshots and I can easily bring up a new member well within the current oplog window. Snapshot + rsync is easily an order of magnitude (if not two) faster than initial sync. I know there are other reasons to have a right-sized oplog, but this is working for now.

I’d prefer to not shard yet, and the indexes are what they are. I’ve had to make tradeoffs for application efficiency as any db admin has I suppose.

Can you elaborate on how an initial sync gets me to zstd from zlib? It is my understanding that the initial sync is going to use the same creation scheme as the source, but maybe there is a way to change that?

If a new initial sync is the right way to get to zstd, then how can I speed it up (and I mean drastically)? Setting the read preference isn’t going to make that much difference and at the current rate I’d need a HUGE oplog, maybe even bigger than my database.

Edit I just realized that perhaps I need to create another database and copy the old zlib collections into new zstd ones. That will require a fair bit of downtime though.

The storage engine of the node will determine its own compression. So the instance you are reinitializing should have the compressors configured.

Stop the node to update. Remove the data files. Add your desired compressor configuration. Start the node.

With a set of sampledata you can observe the difference with the third node with the blockcompressor enabled.

All of these are brand new replicaset members before chaning the 3rd node.

$ for l in a b c ; do mongo --quiet --host mongo-0-${l} --eval "db.getMongo().setSecondaryOk(); db.adminCommand('listDatabases').totalSize" ; done
159854592
159002624
107610112

The manual says it starts the oplog copy when the initial sync starts. So I don’t think that will be an issue.
But this is a lot of data to scan from compressed blocks and transmit. I don’t have any tricks to speed this up.

1 Like