EventGet 50% off your ticket to MongoDB.local NYC on May 2. Use code Web50!Learn more >>
MongoDB Developer
MongoDB
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
MongoDBchevron-right

Resumable Initial Sync in MongoDB 4.4

Nuno Costa5 min read • Published Dec 16, 2021 • Updated May 16, 2022
MongoDB
Facebook Icontwitter iconlinkedin icon
Rate this article
star-empty
star-empty
star-empty
star-empty
star-empty

Introduction

Hello, everyone. My name is Nuno and I have been working with MongoDB databases for almost eight years now as a sysadmin and as a Technical Services Engineer.
One of the most common challenges in MongoDB environments is when a replica set member requires a resync and the Initial Sync process is interrupted for some reason.
Interruptions like network partitions between the sync source and the node doing the initial sync causes the process to fail forcing it to restart from scratch to ensure database consistency.
This began to be particularly problematic when faced with a large dataset sizes which can take up to several days when they are in terms of terabytes.
You may have already noticed that I am talking in the past tense as this is no longer a problem you need to face. I am very happy to share with you one of the latest enhancements introduced by MongoDB in v4.4: Resumable Initial Sync.
Resumable Initial Sync now enables nodes doing initial sync to survive events like transient network errors or a sync source restart when fetching data from the sync source node.

Resumable Initial Sync

The time spent when recovering replica set members with Initial Sync procedures on large data environments has two common challenges:
  • Falling off the oplog
  • Transient network failures
MongoDB became more resilient to these types of failures with MongoDB v3.4 by adding the ability to pull newly added oplog records during the data copy phase, and more recently with MongoDB v4.4 and the ability to resume the initial sync where it left off.

Behavioral Description

The initial sync process will restart the interrupted or failed command and keep retrying until the command succeeds a non-resumable error occurs, or a period specified by the parameter initialSyncTransientErrorRetryPeriodSeconds passes (default: 24 hours). These restarts are constrained to use the same sync source, and are not tolerant to rollbacks on the sync source. That is if the sync source experiences a rollback, the entire initial sync attempt will fail.
Resumable errors include retriable errors when ErrorCodes::isRetriableError return true which includes all network errors as well as some other transient errors.
The ErrorCodes::NamespaceNotFound, ErrorCodes::OperationFailed, ErrorCodes::CursorNotFound, or ErrorCodes::QueryPlanKilled mean the collection may have been dropped, renamed, or modified in a way which caused the cursor to be killed. These errors will cause ErrorCodes::InitialSyncFailure and will be treated the same as transient retriable errors (except for not killing the cursor), mark ErrorCodes::isRetriableError as true, and will allow the initial sync to resume where it left off.
On ErrorCodes::NamespaceNotFound, it will skip this entire collection and return success. Even if the collection has been renamed, simply resuming the query is sufficient since we are querying by UUID; the name change will be handled during oplog application.
All other errors are non-resumable.

Configuring Custom Retry Period

The default retry period is 24 hours (86,400 seconds). A database administrator can choose to increase this period with the following command:
Note: The 24-hour value is the default period estimated for a database administrator to detect any ongoing failure and be able to act on restarting the sync source node.

Upgrade/Downgrade Requirements and Behaviors

The full resumable behavior will always be available between 4.4 nodes regardless of FCV - Feature Compatibility Version. Between 4.2 and 4.4 nodes, the initial sync will not be resumable during the query phase of the CollectionCloner (where we are actually reading data from collections), nor will it be resumable after collection rename, regardless of which node is 4.4. Resuming after transient failures in other commands will be possible when the syncing node is 4.4 and the sync source is 4.2.

Diagnosis/Debuggability

During initial sync, the sync source node can become unavailable (either due to a network failure or process restart) and still, be able to resume and complete.
Here are examples of what messages to expect in the logs.
Initial Sync attempt successfully started:
Messages caused by network failures (or sync source node restart):
Initial Sync is resumed after being interrupted:
Data cloners resume:
Data cloning phase completes successfully. Oplog cloning phase starts:
Initial Sync completes successfully and statistics are provided:
The new InitialSync statistics from replSetGetStatus.initialSyncStatus can be useful to review the initial sync progress status.
Starting in MongoDB 4.2.1, replSetGetStatus.initialSyncStatus metrics are only available when run on a member during its initial sync (i.e., STARTUP2 state).
The metrics are:
For each Initial Sync attempt from replSetGetStatus.initialSyncStatus.initialSyncAttempts:
  • totalTimeUnreachableMillis - The total time in milliseconds that the member has been unavailable during the current initial sync.
  • operationsRetried - Total number of all operation retry attempts.
  • rollBackId - The sync source's rollback identifier at the start of the initial sync attempt.
An example of this output is:

Wrap Up

Upgrade your MongoDB database to the new v4.4 and take advantage of the new Resumable Initial Sync feature. Your deployment will now survive transient network errors or a sync source restarts.
If you have questions, please head to our developer community website where the MongoDB engineers and the MongoDB community will help you build your next big idea with MongoDB.

Facebook Icontwitter iconlinkedin icon
Rate this article
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Article

Real-Time Card Fraud Solution Accelerator with MongoDB and Databricks


Jul 11, 2023 | 7 min read
Article

Triggers Treats and Tricks: Cascade Document Delete Using Triggers Preimage


May 13, 2022 | 3 min read
Tutorial

Creating a User Profile Store for a Game With Node.js and MongoDB


Feb 03, 2023 | 10 min read
Tutorial

Analyze Time-Series Data with Python and MongoDB Using PyMongoArrow and Pandas


Sep 21, 2023 | 6 min read
Table of Contents