Upgrade driver versions unstable

Arik_Sasson · March 16, 2021, 12:56pm

Hi,
We are having jobs that use the mongo spark connector: https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector_2.12/3.0.0
A month ago we set the connector to run with a specific driver 4.0.5. After few days of successful runnings, the jobs fail, and the only way that the process succeeded to run is to upgrade to a new driver version: 4.2.0.

Again, after few days of successful running, the process that configures the same with 3.0.0 connector and driver 4.2.0 fail and the only solution that succeeded is upgrading to a newer driver version 4.2.2.

Eventually, it seems that if a new version of a driver came up, so an existing older version fails and we can’t figure out why?
If that case is common to more users?
That is the last configuration that is stable so far, but we can’t be sure it won’t fail once a newer driver version will be launched.
–packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.0,org.mongodb:mongodb-driver-sync:4.2.2
Can you please assist?

Ross_Lawley · March 16, 2021, 1:20pm

HI @Arik_Sasson,

The 3.0.0 spark connector sets the sync java driver version in its pom to be 4.0.5 and is only tested with that combination.

I think you may need to provide more information regarding the errors you are seeing, to help understand the cause.

Ross

Almog_Gelber · March 16, 2021, 3:14pm

Hey,

We got two types of errors:

When using mongo driver version 4.2.0 we got:

py4j.protocol.Py4JJavaError: An error occurred while calling o119.load.
: java.lang.NoClassDefFoundError: com/mongodb/client/model/WriteModel
at com.mongodb.spark.sql.DefaultSource.constructRelation(DefaultSource.scala:89)
at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:61)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:342)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:221)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.mongodb.client.model.WriteModel
at java.net.URLClassLoader$1.run(URLClassLoader.java:371)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
… 19 more
Caused by: java.util.zip.ZipException: invalid LOC header (bad signature)
at java.util.zip.ZipFile.read(Native Method)
at java.util.zip.ZipFile.access$1400(ZipFile.java:60)
at java.util.zip.ZipFile$ZipFileInputStream.read(ZipFile.java:734)
at java.util.zip.ZipFile$ZipFileInflaterInputStream.fill(ZipFile.java:434)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at sun.misc.Resource.getBytes(Resource.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:463)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
… 24 more

When using version 4.0.5 we got Partitioner errors (with all kinds of partitioners)
for example this one:

Partitioning using the ‘MongoShardedPartitioner’ failed.

Please check the stacktrace to determine the cause of the failure or check the Partitioner API documentation.
Note: Not all partitioners are suitable for all toplogies and not all partitioners support views.%n

Ross_Lawley · March 16, 2021, 3:32pm

Hi @Almog_Gelber,

Firstly its unclear why when previously running the partitioner would suddenly fail. Without further information then its impossible to determine the cause.

For: NoClassDefFoundError: com/mongodb/client/model/WriteModel it indicates that there is an issue in the class path and the required write model is not available. Has the Spark Executor and all the Spark Driver nodes been updated?

Again, after few days of successful running, the process that configures the same with 3.0.0 connector and driver 4.2.0 fail and the only solution that succeeded is upgrading to a newer driver version 4.2.2

This also appears that the real cause of the error has not been determined.

If the Spark job takes multiple days to run and it failed in the initial instance, then just updating the driver wouldn’t necessarily be expected to fix the issue (unless there was a driver bug).

So I think the next step would be to determine the root cause of failure and proceed from there.

Ross

Arik_Sasson · March 17, 2021, 7:57am

I Ross,
Thanks for your quick response.
Actually, if we had known the root cause for it, we would have tried to solve it or find a solution for it.
The problem is that we don’t have a thread for the root cause and the flow mentioned in the last reply is the information that we wrote.
So since we have made any changes and the process fail twice and the only info that we realize it might be related to the failures, is of new driver upgrade.

if you can point us to find the root cause or might encounter the same situation with other mongo users it can be very helpful.
Thanks,

Ross_Lawley · March 17, 2021, 11:08am

Hi @Arik_Sasson,

There isn’t enough information here to understand the cause of the failure. Ideally, providing a minimal reproducible example would help as I could replicate the issue to understand the cause.

Failing that more information about the spark job is required to understand what the error actually is:

What version of Spark?
What language is the Spark job running?
What version of MongoDB?
What configuration for Spark?
What does the Spark job do?
How long does the Spark job take to run ?
After few days of successful runnings, the jobs fail. Does this mean the jobs take multiple days or are run multiple times?
What errors occur when the jobs fail?
Can you provide Spark logs?
Are you using RDD’s or Datasets / DataFrames?
If using Datasets/DataFrames are you providing the Schema?

Once I understand more about the error, I can help look at ways to mitigate the error.

All the best,

Ross

Almog_Gelber · March 17, 2021, 1:50pm

spark 3.0.1
pyspark
not sure (@Arik_Sasson can you tell?)
spark config - default. dynamic allocation with 200 max executors.
the job reads from mongo collection and store the data in s3
about 10-20 minutes
it runs each day to take the updated data, after a few successful runs, it suddenly fails
added the errors /logs in my previous response
using data frames
yes, schema is provided

Ross_Lawley · March 17, 2021, 2:05pm

Hi,

So to clarify the only error you are seeing is a NoClassDefFoundError :

py4j.protocol.Py4JJavaError: An error occurred while calling o119.load.
: java.lang.NoClassDefFoundError: com/mongodb/client/model/WriteModel

And this error occurs even though running the job previously worked and nothing else has changed?

One of the Spark Drivers (also known as Spark Workers) is not configured correctly. As far as I can tell either it has a different version of the Mongo Spark Connector installed or a partial installation of the Mongo Spark Connector (without the Mongo Java driver classes).

Ross

Arik_Sasson · March 17, 2021, 3:46pm

The Mongo version is: 4.0.10

Almog_Gelber · March 17, 2021, 3:59pm

@Ross_Lawley
Thanks for the quick response.

We are running in client mode using single master node. (and many workers of course)
We are doing the installation of the connector using --packages in the spark submit command.
Why do you think there could be any issue with the installation? and why it suddenly happens after many successful runs?

Thanks,
Almog

Arik_Sasson · March 21, 2021, 8:13am

Hi @Ross_Lawley
Should we supply any other information that could assist to point of the root cause?

Ross_Lawley · March 22, 2021, 10:27am

Hi @Arik_Sasson,

I would check all the class paths on each of the worker nodes and ensure that they are as expected. Also, it would be worth double checking each worker node is running the correct version of Spark.

Ross