DBException thrown :: caused by :: 9001 socket exception

HI there,
I am in the middle of a complex upgrade for a customer. Essentially updating 3 clusters form 3.0.12 to a minimum of 3.4 .

I have hit an issue with logs, which others have hit, but having tried all suggestions I have no solution working… any help really appreciated.

On a small Dev cluster, running nodes on 3.0.12, in Primary/Secondary/Arbiter, unsharded, I have added a 3.2.22 data node as a secondary, and it is pushing out a lot of these, with very few connections/activity happening:

==================================================================
2020-02-12T20:42:12.953Z W -        [conn606] DBException thrown :: caused by :: 9001 socket exception [CLOSED] for 172.31.63.194:49488
2020-02-12T20:42:12.958Z I -        [conn606] 
 0x155c5e2 0x155c40d 0x14d8a50 0x150e006 0x150e71b 0x150e731 0x150e78d 0x14ff75d 0x150242e 0x7fdc06e0a6db 0x7fdc06b3388f
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"115C5E2","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"115C40D","s":"_ZN5mongo15printStackTraceEv"},{"b":"400000","o":"10D8A50","s":"_ZN5mongo11DBException13traceIfNeededERKS0_"},{"b":"400000","o":"110E006","s":"_ZN5mongo6Socket15handleRecvErrorEii"},{"b":"400000","o":"110E71B","s":"_ZN5mongo6Socket5_recvEPci"},{"b":"400000","o":"110E731","s":"_ZN5mongo6Socket11unsafe_recvEPci"},{"b":"400000","o":"110E78D","s":"_ZN5mongo6Socket4recvEPci"},{"b":"400000","o":"10FF75D","s":"_ZN5mongo13MessagingPort4recvERNS_7MessageE"},{"b":"400000","o":"110242E","s":"_ZN5mongo17PortMessageServer17handleIncomingMsgEPv"},{"b":"7FDC06E03000","o":"76DB"},{"b":"7FDC06A12000","o":"12188F","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.22", "gitVersion" : "105acca0d443f9a47c1a5bd608fd7133840a58dd", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.15.0-1057-aws", "version" : "#59-Ubuntu SMP Wed Dec 4 10:02:00 UTC 2019", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "C2070FF92CF0E7C7AF25D84027F691037262CEA2" }, { "b" : "7FFD040E5000", "path" : "linux-vdso.so.1", "elfType" : 3, "buildId" : "D05895E5E385880D40A2B0A20CF7D8C9B06423D6" }, { "b" : "7FDC07E27000", "path" : "/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "0D054641049B9747C05D030262295DFDFDD3055D" }, { "b" : "7FDC079E4000", "path" : "/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "9C228817BA6E0730F4FCCFAC6E033BD1E0C5620A" }, { "b" : "7FDC077DC000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "9826FBDF57ED7D6965131074CB3C08B1009C1CD8" }, { "b" : "7FDC075D8000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "25AD56E902E23B490A9CCDB08A9744D89CB95BCC" }, { "b" : "7FDC0723A000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "A33761AB8FB485311B3C85BF4253099D7CABE653" }, { "b" : "7FDC07022000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "41BDC55C07D5E5B1D8AB38E2C19B1F535855E084" }, { "b" : "7FDC06E03000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "28C6AADE70B2D40D1F0F3D0A1A0CAD1AB816448F" }, { "b" : "7FDC06A12000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "B417C0BA7CC5CF06D1D1BED6652CEDB9253C60D0" }, { "b" : "7FDC0808F000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "64DF1B961228382FE18684249ED800AB1DCEAAD4" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x155c5e2]
 mongod(_ZN5mongo15printStackTraceEv+0xDD) [0x155c40d]
 mongod(_ZN5mongo11DBException13traceIfNeededERKS0_+0x140) [0x14d8a50]
 mongod(_ZN5mongo6Socket15handleRecvErrorEii+0xEE6) [0x150e006]
 mongod(_ZN5mongo6Socket5_recvEPci+0x5B) [0x150e71b]
 mongod(_ZN5mongo6Socket11unsafe_recvEPci+0x11) [0x150e731]
 mongod(_ZN5mongo6Socket4recvEPci+0x3D) [0x150e78d]
 mongod(_ZN5mongo13MessagingPort4recvERNS_7MessageE+0x9D) [0x14ff75d]
 mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x2EE) [0x150242e]
 libpthread.so.0(+0x76DB) [0x7fdc06e0a6db]
 libc.so.6(clone+0x3F) [0x7fdc06b3388f]
==============================================================================

I have:

  1. ulimit set to 64000
  2. AWS EC2 instance … running Ubuntu 18.04
  3. Tried setting sysctl -w net.ipv4.tcp_keepalive_time=120 - No difference

Any other suggestions greatfully received. I am sort of at a loss to know what is causing this… is it a version incompatability?

Many thx in advance

The upgrade procedure is to update the binary in place. Not to add a new node. Have you tried that ?

There is also likely a related log on one/all of the existing nodes. Take a look.

Also… nope there aren’t related log entries on the other existing nodes… which admittedly is strange , but leads me to thing it may be version conflicts.

Sadly this is not a possibility… due to teh way the live applications are written. Would have to call for a site outage to do that. So am building the cluster to have new nodes, along with old nodes, and get teh connection parameters (in many places) to be changed to pint to new nodes.

Should also say that the current nodes are hugely overprovisioned, and need to go… which precludes upgrading in place.

One of the benefits of replica set is that you can do this operations on the secondaries in a rolling fashion. So no site outage. I’ve done version upgrades and maintenance in this fashion many times.

The primary still has to be stepped down at some point for the upgrade anyway.

I guess you could add right-sized 3.0 nodes and try and upgrade them.

I understand that… sorry… have had a 50 node cluster before. This is the first time I have seen this. I am getting this error on one of teh new nodes (right sized) … and its baffling me.

The idea was to add teh right sized nodes, with a new version… the change the app access IPs… then take out the old nodes.

Guess I am going to have to try putting th old vestion onto the new node… see if that gets rid of it, then upgrade the new nodes once the app connects to them properly.

THx for the input though… appreciated

NP. At least adding a 3.0 will rule out the version incompatibility.

Also. Update here for prosperity :slight_smile:

Ok… so for me… this is the scenario… and my take on it.

What we have is (and dont laugh please… its not my setup) :

  • 3 Nodes running 3.0.14, inside Docker containers (I know… why would you do that).
  • 2 Nodes on 3.2.22 (as a part of my upgrade process)
    We were getting that error ^ on the logs on the 3.2.22 nodes, but only when communicating to the 3.0.14 nodes. We did not get them between the 3.2.22 nodes.

Also, the 3.0.14 nodes did not report the error when communicating to teh 3.2.22 nodes. They just reported connection opens and closes.

I have deduced from this it is a version issue with the heartbeat. Thee errors happen every 30 secinds , like clockwork. Data gets through from Primary to secondaries Ok, so it is not affecting operations.

I tried installing the older version on teh new nodes, but hit an issue as I have stopped automatic upgrades. I suspect that would have fixed it… but then , for me, I would have hit it again as a part of a rolling update.

Consequently I am leaving it, as teh old nods will be disappearing soon.

Its a version issue between 3.0.14 and 3.2.22… unless it crops up again.

Hope this helps someone, as it may not be limited to these versuions… and also I am not 100% suere docker isn;t playing a part in this either… but thats going too.

A final (ish) update. I have managed to kill that message , in a bad way I think.

If you set systemLog.traceAllExceptions:false in teh config file, it goes away. I beleive this will also stop any other DBEXceptionBacktraces as well… so its not a good solution, but the only one I have found.

I am hoping this is fixed in a later version, or at least set up to be ‘ignorable’ , as stopping all DBEXceptions form logging seems to be a bad move.

The 9001 socket error, is purely a closed connection reported, so a new one starts. So it is not really an ‘Exception’ … but is reported as such.

Maybe mongodb engineers could answer that… or I will report back here if it goes away in later versions.