How to make mongorestore restore all the data?

Howdy!
I’m copying the contents of an old mongodb 3.2.11 running on a Google Compute Engine (GCE) VM to a fresh installation of mongodb 4.4 on a new GCE VM.

Creating a new VM lets us revisit VM parameters, test the server before switching over, and leave behind unknown state on the old VM.

The Mongo docs don’t promise that an archive dumped from one mongo release can be restored into a newer mongo release. They do say to use the release of mongodump that goes with the source mongodb and the release of mongorestore that goes with the destination mongodb.

What I did so far:

  • Created the new GCE VM with Debian 10 and installed mongo per
    https://docs.mongodb.com/manual/tutorial/install-mongodb-on-debian/
  • Ran mongodump on the source VM:
    time mongodump --gzip --archive=mongo-$(date +"%Y-%m-%d").archive.gz
    It didn’t like the --repair and --oplog options so I skipped those.
    This created a 9.0GB file in close to 5 hours.
  • Copied the archive to the destination VM.
  • Ran mongorestore on the destination VM:
    time mongorestore --objcheck --drop --maintainInsertionOrder --gzip --archive=mongo-2021-04-13.archive.gz
    This took only 39 seconds and didn’t restore all the contents, judging by the “show dbs” sizes:
    admin                       0.000GB
    config                      0.000GB
    fireworks                   0.000GB
    jerry                       0.001GB
    local                       0.000GB
    # more...
    simulations                 0.024GB
    
  • No doubt the db sizes could vary with fragmentation and such, but the source VM shows simulations at 17.497GB.
  • mongorestore’s --preserveUUID option caused some warning messages, so I retried without that.
  • Leaving out --maintainInsertionOrder or --objcheck or --nsInclude="simulations.*" didn’t make a noticeable difference. The show dbs sized varied from 24 - 29 MB after various runs, but maybe that’s just due to fragmentation.

Notes:

  • We’re not using mongo users, authentication, or replication. No need to copy admin data.
  • It’s fine to have downtime during this conversion.
  • Our pymongo driver is up to date.
  • The new server works fine for FireWorks workflows. It just doesn’t have all the simulations data.
  • I’m a software developer with little MongoDB experience needing to do sysadmin duty on this.
  • Searching the web, Stack Overflow, and this forum didn’t find an answer; only an encouraging You should not worry about it. You will not encounter any problems; also a bash script to run the data through a series of major releases of mongo, each in a Docker image. Using Docker is a great idea here but I’m trying to avoid trudging through all those intermediate releases.

Q1. How to make mongorestore restore all of the simulations DB?
Q2. How to verify that it did, at least to the level of document counts and such?

Thanks so much!

A2. The mongo shell command db.stats() gives clear stats on a db.

Rerunning mongorestore with more verbosity -vvvvv didn’t log any new info, but I finally noticed Killed at the end of the mongorestore command, before the time stats! So mongorestore ran out of memory. :crazy_face:

A1. Run mongorestore from a VM with enough memory or swap enabled. (GCE VMs have swap disabled by default.)

mongorestore is up to 81GB on simulations.history while show dbs shows 12.661GB for the entire simulations db. Apparently those stats aren’t comparable.

mongorestore (version 100.3.1) required a humongo 128GB of RAM to restore the simulation.history collection.

I managed that by temporarily resizing the Compute Engine VM. Memory usage while running mongorestore --nsInclude="simulations.history" [etc.]:

This is while restoring one collection with storageSize: 18753490944.

In comparison, mongodump (version go1.7.4) dumped and gzipped that 9GB archive on the same VM as the running mongo server 3.2.11, a VM with only 1.75GB RAM and no swap space.

Maybe “steady state” use: