Massive difference between max number of ops/sec with PyMongo and MongoDB Java driver

Ahmed_Dhanani · April 13, 2020, 6:55pm

Hi all. I was looking into some performance related tasks for an application that I am developing using Flask, served via Gunicorn. I am using MongoDB as my primary database and PyMongo as the driver to connect with MongoDB server via Python. Lately I felt there was something that was slowing down the overall calls to DB and to application server as well. I looked into it and tried to compare PyMongo with other drivers (Java). I came up with two scripts. My Python script looks like this:

import time
import pymongo
import multiprocessing.pool

if __name__ == '__main__':
    client = pymongo.MongoClient('mongodb://username:pass@DB_URL2,DB_URL3,DB_URL4,DB_URL5/test-r-demo?replicaSet=rs0&retryWrites=true&readPreference=secondary')
    collection = client['test-r-demo']['fpi_user']
    TOTAL_OPS = 500000
    C_THREADS = 50

    def work(collection):
        collection.find_one({'phone_number': '03052506670'}, {'id': 1})

    with multiprocessing.pool.ThreadPool(C_THREADS) as p:
        threads = []
        for i in range(TOTAL_OPS):
            threads.append(collection)

        start_time = time.time()
        ret = p.map(work, threads)
        end_time = time.time()
        print('Total {} operations, with {} threads, took {}s'.format(TOTAL_OPS, C_THREADS, round(end_time - start_time, 3)))

While executing the above script I went on the DB machine turned on mongostat to observe the query ops/sec. It barely crosses 1200 mark on each DB machine.

However, the results are totally different when using the Java MongoDB driver:

StressTest.java

import com.mongodb.*;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoClients;
//import com.mongodb.MongoClientURI;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import org.bson.Document;
import org.example.stresstest.StressTestThread;

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
import static com.mongodb.client.model.Filters.eq;

public class StressTest {
    public static void main(String[] args) {


        int NUM_OPS = 500000;
        int NUM_THREADS = 50;
        /*
        initializing mongo client
        */
        MongoClient client = MongoClients.create("mongodb://username:pass@DB_URL2,DB_URL3,DB_URL4,DB_URL5/test-r-demo?replicaSet=rs0&retryWrites=true&readPreference=secondary");

        /*
        initialize database
        */
        MongoDatabase database = client.getDatabase("test-r-demo");

        /*
        initialize collection
        */
        MongoCollection<Document> collection = database.getCollection("fpi_user");

        /*
        Initialize thread pool executor with fixed threads
        */
        ThreadPoolExecutor threadPool = (ThreadPoolExecutor)Executors.newFixedThreadPool(NUM_THREADS);

        /*
        Loop for retrieving data from database using collection
        for this we are creating threads and assign task to each thread
        */
        long startTime = System.currentTimeMillis();
        for (long i=0; i<NUM_OPS; i++) {
            Runnable runnable = new StressTestThread(collection);
            threadPool.execute(runnable);
        }

        threadPool.shutdown();
        long totalTime = (System.currentTimeMillis() - startTime) / 1000;
        String out = String.format("Completed %1s operations in %2s seconds with a an average of %3s ops/sec.", NUM_OPS, (totalTime), (totalTime / NUM_THREADS));
        System.out.println(out);

    }
}

StressTestThread.java

package org.example.stresstest;
import com.mongodb.client.MongoCollection;
import org.bson.Document;
import static com.mongodb.client.model.Filters.eq;
import static com.mongodb.client.model.Projections.*;

public class StressTestThread implements Runnable {

    private MongoCollection<Document> collection;

    public StressTestThread(MongoCollection<Document> collection){
        this.collection = collection;
    }

    @Override
    public void run() {
        Document myDoc = collection.find(eq("phone_number", "03052506670")).projection(fields(include("id"), excludeId())).first();
        //System.out.println(myDoc.get("id"));
    }
}

Surprisingly the max ops/sec when using MongoDB’s Java driver, reaches to 13k easily. What could be the potential problem?

Shane · April 14, 2020, 5:15pm

I noticed that the Java benchmark is using excludeId() in the query projection but the Python benchmark is not. This allows the server to perform the covered query optimization which may explain some of the difference. Please make this change to exclude the _id field:

    def work(collection):
        collection.find_one({'phone_number': '03052506670'}, {'id': 1, '_id': 0})

Ahmed_Dhanani · April 14, 2020, 8:35pm

I also tried removing the excludeId() projection from the Java snippet to see if it causes a degradation or some difference in the figures. But it stays at a ~13K ops/sec.

Ahmed_Dhanani · April 14, 2020, 8:35pm

Hi Shane. Thanks for looking into it. I tried to exclude the _id field using projection, but that didn’t help. It still doesn’t cross 1200 ops/sec.