Oplog dump issues

Hello,

I have scheduled the oplog incremental backups but it fails too often with below errors. I have tried increasing WT cache and Oplog size as well but it didn’t made any difference.

2020-07-19T09:06:08.711+0000 I STORAGE [WT RecordStoreThread: local.oplog.rs] WiredTiger record store oplog truncation finished in: 282ms
2020-07-19T09:06:08.726+0000 E QUERY [conn73957] Plan executor error during find command: DEAD, stats: { stage: “COLLSCAN”, filter: { $and: [ { ts: { $lte: Timestamp(1595149556, 11466) }
}, { ts: { $gt: Timestamp(1595148019, 493) } } ] }, nReturned: 0, executionTimeMillisEstimate: 9010, works: 1222090, advanced: 0, needTime: 1222089, needYield: 0, saveState: 9553, restoreS
tate: 9553, isEOF: 0, invalidates: 0, direction: “forward”, docsExamined: 1222088 }
2020-07-19T09:06:08.728+0000 I COMMAND [conn73957] command local.oplog.rs command: find { find: “oplog.rs”, filter: { ts: { $gt: Timestamp(1595148019, 493), $lte: Timestamp(1595149556, 11466) } }, skip: 0, $readPreference: { mode: “secondaryPreferred” }, $db: “local” } planSummary: COLLSCAN numYields:9553 reslen:522 locks:{ Global: { acquireCount: { r: 19108 }, acquireWaitCount: { r: 129 }, timeAcquiringMicros: { r: 2253882 } }, Database: { acquireCount: { r: 9554 } }, oplog: { acquireCount: { r: 9554 } } } protocol:op_query 11479ms

Also, I have noticed whenever this thread “STORAGE [WT RecordStoreThread: local.oplog.rs] WiredTiger record store oplog truncation finished in” runs just before my oplog dump schedule, it fails.

2020-08-01T04:39:09.679+0000 Failed: error writing data for collection local.oplog.rs to disk: error reading collection: Executor error during find command :: caused by :: errmsg: “CollectionScan died due to position in capped collection being deleted. Last seen record id: RecordId(6855387362038385666)”

Note : Timestamps are different because I just picked random logs from different dates but the error is same everytime.

Could anyone please help ?

Hello @Divyanshu_Soni

It would be helpful for others to read and respond to your question if you could apply the proper code formatting to your post.

• Log lines can be formatted better by prefixing with > (aka quoting). This will wrap long lines and provide some visual distinction from code blocks. The editing toolbar has a shortcut for toggling line quoting on the currently selected text: . Note: you can also use the (```) formatting for log snippets, but quoting generally works better.

• Add spacing to break up long blocks of text and improve readability

• Add triple backticks (```) before and after a snippet of code. This syntax (aka code fencing in GItHub-flavoured Markdown) will automatically detect language formatting 1 and generally be an improvement over a straight copy & paste.

• Review the Getting Started guide from @Jamie which has some further great tips and information.

Cheers,
Michael

1 Like

Hi @Michael Below is the error from mongod logs.

2020-07-14T21:24:14.035+0000 I STORAGE [WT RecordStoreThread: local.oplog.rs] WiredTiger record store oplog truncation finished in: 330ms
2020-07-14T21:24:14.076+0000 E QUERY [conn45653] Plan executor error during find command: DEAD, stats: { stage: “COLLSCAN”, filter: { $and: [ { ts: { $lte: Timestamp(1594761831, 5704) } }, { ts: { $gt: Timestamp(1594759104, 1447) } } ] }, nReturned: 0, executionTimeMillisEstimate: 10900, works: 3811994, advanced: 0, needTime: 3811993, needYield: 0, saveState: 29782, restoreState: 29782, isEOF: 0, invalidates: 0, direction: “forward”, docsExamined: 3811992 }

And from the script which I am running

oplog-prd-mon-bzm-shard-hd03.c.blazemeter-dr.internal-20200713.log-Enter password:
oplog-prd-mon-bzm-shard-hd03.c.blazemeter-dr.internal-20200713.log-2020-07-14T21:23:51.869+0000 writing local.oplog.rs to stdout
oplog-prd-mon-bzm-shard-hd03.c.blazemeter-dr.internal-20200713.log-2020-07-14T21:23:54.498+0000 local.oplog.rs 0
oplog-prd-mon-bzm-shard-hd03.c.blazemeter-dr.internal-20200713.log-2020-07-14T21:23:57.498+0000 local.oplog.rs 0
oplog-prd-mon-bzm-shard-hd03.c.blazemeter-dr.internal-20200713.log-2020-07-14T21:24:00.498+0000 local.oplog.rs 0
oplog-prd-mon-bzm-shard-hd03.c.blazemeter-dr.internal-20200713.log-2020-07-14T21:24:03.498+0000 local.oplog.rs 0
oplog-prd-mon-bzm-shard-hd03.c.blazemeter-dr.internal-20200713.log-2020-07-14T21:24:06.498+0000 local.oplog.rs 0
oplog-prd-mon-bzm-shard-hd03.c.blazemeter-dr.internal-20200713.log-2020-07-14T21:24:09.498+0000 local.oplog.rs 0
oplog-prd-mon-bzm-shard-hd03.c.blazemeter-dr.internal-20200713.log-2020-07-14T21:24:12.498+0000 local.oplog.rs 0
oplog-prd-mon-bzm-shard-hd03.c.blazemeter-dr.internal-20200713.log-2020-07-14T21:24:14.093+0000 local.oplog.rs 0
oplog-prd-mon-bzm-shard-hd03.c.blazemeter-dr.internal-20200713.log:2020-07-14T21:24:14.093+0000 Failed: error writing data for collection local.oplog.rs to disk: error reading collection: Executor error during find command :: caused by :: errmsg: “CollectionScan died due to position in capped collection being deleted. Last seen record id: RecordId(6848267264039321613)”

Also, I have 1TB of oplog size and good amount of writes against the cluster. The oplog window is also large hence no possibilties of the documents getting truncated while dump is running.

I have defined 70GB of WT cache. Can you please help to avoid the issue ?

Apart from this, the mongodump takes a lot of time to finish.

Ram : 128GB
CPU Usage / IOStat during dump

avg-cpu: %user %nice %system %iowait %steal %idle
4.69 0.00 1.14 0.64 0.00 93.53
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc 0.00 0.00 0.02 0.10 3.44 16.32 336.89 0.00 1.77 4.26 1.39 0.84 0.01
sda 0.00 0.04 0.26 0.87 9.21 7.61 29.95 0.00 1.45 2.27 1.20 0.55 0.06
dm-1 0.00 0.00 0.02 0.07 3.43 16.32 474.55 0.00 2.50 4.27 2.10 1.18 0.01
sdb 0.00 0.70 202.89 421.49 12929.15 8300.63 68.00 0.27 0.44 0.11 0.60 0.21 12.85
dm-0 0.00 0.00 202.91 401.28 12929.15 8300.63 70.28 0.29 0.48 0.12 0.66 0.21 12.87

Please let me know if you need more details.

Regards,
Divyanshu Soni

Hello @Divyanshu_Soni,

thanks for the update. mongodump is not the most efficient or scalable backup approach to use, as it requires all data to be read and dumped via the mongod process. A mongorestore has to recreate all data files and rebuild indexes, so will also take longer to restore than a backup approach such as filesystem snapshots.

If your data to be backed up is significantly larger than RAM, the backup and restore time will increase with the growth in your data set.

If you are backing up a sharded cluster, there are more moving parts to coordinate and mongodump is not a viable backup approach if you are also using sharded transactions in MongoDB 4.2+.

Concerning the concrete error I like to pass this to the DBAs around.

Cheers,
Michael