Watch keynotes and sessions from MongoDB.live, our virtual developer conference.

How to throttle mongo-spark-connector

My app server creates queries and inserts data into mongo based on live user actions. This is important and should take precedence over reading from Mongo by Spark for data analysis, which runs concurrently. At present we get timeouts when trying do live action-based queries during Spark read tasks.

How do I throttle down the load mongo-spark-connector puts on Mongo so that my live input can continue to be inserted while Spark is reading from Mongo?

UPDATE: Maybe a clue to controlling load from Spark could be what the load is related to. Number of partitions, number of cores for the job, something in the Spark config or Job config?

There are a few things to consider. First, make sure your spark job connections are specifying a read preference of secondary or secondaryPreferred. This will ensure the read burden off of the primary. If you are also writing and still have issues you may want to add additional "mongoS"s and use log files to further troubleshoot where the bottle neck is.

We understand how to scale Mongo but that is not our problem. The problem we face is that Spark can be HUGE in terms of the number of cores used. We should not have to scale Mongo to support this since we only use Spark for a few hours per week to generate ML models. The rest of the time mongo performs quite well for data input, which comes in continuously. What we need to do is scale the mongo-spark-connector so it doesn’t overload the mongo nodes we have already scaled to fit our live load (+ some margin). In short we need to scale the load put on mongo from Spark, not scale Mongo to handle Spark load, which (without some way to throttle mongo-spark-connector) is FAR in excess of what is normally needed by the system.

For example when we write from Spark we can “repartition” the Spark Dataset and this indirectly throttles the connections made for the write.

But for input there is no dataset to partition until the Spark read happens.

Hope this helps explain the issue and many thanks for your attention. We love Mongo and hope that solving this will allow others to use it in a similar way.