Index creation - slow read speed on scan collection

Francesco_Rivola · June 14, 2021, 7:46am

Hi, we are creating an index using the rolling build index pattern (https://docs.mongodb.com/manual/tutorial/build-indexes-on-replica-sets/) on a very large collection (aprox 450 M documents).

Index is a partial index where all documents at the beginning of the collection does not have the contentId field:

db.getCollection('activities').createIndex({
_userId: 1,
_userActivityTypeId: 1,
contentId: 1,
_entityId: 1,
_createdAt: 1
},
{
    partialFilterExpression: {
        contentId: { $exists: true }
    },
    name: 'activities__userId__userActivityTypeId__contentId__entityId__createdAt'
});

The creation is very slow, estimation time to complete is around 100 hours where our oplog time window is currently only 23 hours.

To improve the process we did the following tuning but without luck:

increased maxIndexBuildMemoryUsageMegabytes from 200 to 20000. However we did not see increase on memory usage.
turn off TTL Monitor

Looking at the I/O metrics we found that read speed is almost stable at 1.5 MB/s.

CPU is not bounded, memory is not under pressure, IOPS are low (largely under the limit of the VM/Disk)

During normal operation as secondary node we can observe the read speed go beyond 1.5 MB/s, and also reading an index file manually provide us the speed of 47.6 MB/s:

$ dd if=index-87--9072640376711127209.wt of=/dev/null

7854200+0 records in
7854200+0 records out
4021350400 bytes (4.0 GB, 3.7 GiB) copied, 84.4896 s, 47.6 MB/s

As commented above the index creation process is just scanning the collection and not write anything, as at the beginning all the documents do not have the contentId field. We can observe that the index file size stay always at 4096 bytes.

We have been looking at the doc to try to find any parameter that can improve the read speed of the scanning collection but without luck.

Does anybody know any way to speed up the collection scan speed? Or if this limitation is documented somewhere?

Further Info:

MongoDB 4.2.12
Topology: ReplicaSet 3 data bearing nodes
VM: Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-1046-azure x86_64)
CPU: 4 core
RAM: 32 GiB

Thank you so much,
Francesco

Pavel_Duchovny · June 16, 2021, 7:46am

Hi @Francesco_Rivola ,

Version 4.2 offers hybrid index builds which should enable you to avoid the rolling builds index , so rolling build might nit be necessary:

The new builds are not locking the workload as the old ones.

Moreover, in 4.4 we are building those indexes in all 3 nodes in parallel , is upgrading a possibility?

Otherwise rather than increasing resources and increasing oplog and waiting I don’t have much more ideas.

Best regards,
Pavel

Francesco_Rivola · June 17, 2021, 10:53am

Hi @Pavel_Duchovny ,

First of all, thank you so much for your reply.

Yes, we are aware of the new hybrid index in 4.2. However we were going with rolling builds index because in the past we experienced issues with secondaries suffering during index creation and not being able to keep up in sync with the primary.

Yes, we are planning to upgrade to 4.4. In fact, we already upgraded but had to revert to 4.2 due to a bug in MongoDB 4.4.4 (it should be fixed now in 4.4.6). We are definitely interested in parallel index creation and hidden index features.

Do you know if the 1.5 MB/sec read speed scanning collection during index creation is due to some technical limitation really not tied to the server resources? I guess in any case there is no parameter/setting to tune that part other that what you suggest.

Thank you so much.
Best regards,
Francesco

Pavel_Duchovny · June 18, 2021, 5:11am

Hi @Francesco_Rivola ,

To analyse a performance of this type we need full logs and diagnostics.

This is best covered by a support subscription.

Please consider contacting our sales in that direction as it will be the best approach.

Thanks
Pavel