Any connection to mongoS hangs

Hi!

setup:

  • 3 mongos
  • 3 config server
  • 1 shard with 3 data nodes (I’m looking at sharding one collection during January, hence this setup)
  • 4.2.8 version

This setup was working fine for some time but suddenly any connection to any of the mongoS hangs (local and remote).

mongo --host localhost --port 27017 --username root --password
connecting to: mongodb://localhost:27017/?compressors=disabled&gssapiServiceName=mongodb

Below are some logs on mongoS on restart:

2021-01-04T12:18:29.355+0000 I  CONTROL  [main] ***** SERVER RESTARTED *****
(...)
2021-01-04T12:18:29.434+0000 I  SHARDING [mongosMain] mongos version v4.2.8-8
2021-01-04T12:18:29.434+0000 I  CONTROL  [mongosMain] db version v4.2.8-8
2021-01-04T12:18:29.434+0000 I  CONTROL  [mongosMain] git version: 389dde50b8368b026e41abeeedc4498c24e27fd6
2021-01-04T12:18:29.434+0000 I  CONTROL  [mongosMain] OpenSSL version: OpenSSL 1.1.1  11 Sep 2018
2021-01-04T12:18:29.434+0000 I  CONTROL  [mongosMain] allocator: tcmalloc
2021-01-04T12:18:29.434+0000 I  CONTROL  [mongosMain] modules: none
2021-01-04T12:18:29.434+0000 I  CONTROL  [mongosMain] build environment:
2021-01-04T12:18:29.434+0000 I  CONTROL  [mongosMain]     distarch: x86_64
2021-01-04T12:18:29.434+0000 I  CONTROL  [mongosMain]     target_arch: x86_64
2021-01-04T12:18:29.434+0000 I  CONTROL  [mongosMain] options: { config: "xxx", net: { bindIp: "0.0.0.0", maxIncomingConnections: 3000, port: 27017, processManagement: { fork: true, pidFilePath: "xxx" }, security: { clusterAuthMode: "keyFile", keyFile: "xxx" }, sharding: { configDB: "rscfg/config1:27019,config2:27019,config3:27019" }, systemLog: { destination: "file", logAppend: true, logRotate: "reopen", path: "/log/mongodb/mongos.log", quiet: true, verbosity: 0 } }
2021-01-04T12:18:29.435+0000 I  NETWORK  [mongosMain] Starting new replica set monitor for rscfg/config1:27019,config2:27019,config3:27019
2021-01-04T12:18:29.435+0000 I  CONNPOOL [ReplicaSetMonitor-TaskExecutor] Connecting to config3:27019
2021-01-04T12:18:29.435+0000 I  CONNPOOL [ReplicaSetMonitor-TaskExecutor] Connecting to config1:27019
2021-01-04T12:18:29.435+0000 I  CONNPOOL [ReplicaSetMonitor-TaskExecutor] Connecting to config2:27019
2021-01-04T12:18:29.436+0000 I  SHARDING [thread1] creating distributed lock ping thread for process mongos1:27017:1609762709:-6643752048932861528 (sleeping for 30000ms)
2021-01-04T12:18:29.450+0000 I  NETWORK  [ReplicaSetMonitor-TaskExecutor] Confirmed replica set for rscfg is rscfg/config1:27019,config2:27019,config3:27019
2021-01-04T12:18:29.450+0000 I  SHARDING [Sharding-Fixed-0] Updating sharding state with confirmed set rscfg/config1:27019,config2:27019,config3:27019
2021-01-04T12:18:29.494+0000 I  SHARDING [ShardRegistry] Received reply from config server node (unknown) indicating config server optime term has increased, previous optime { ts: Timestamp(0, 0), t: -1 }, now { ts: Timestamp(1609762708, 1), t: 1 }
2021-01-04T12:18:29.495+0000 I  NETWORK  [shard-registry-reload] Starting new replica set monitor for rs01/node1:27018,node2:27018,node3:27018
2021-01-04T12:18:29.517+0000 W  SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: LockStateChangeFailed: findAndModify query predicate didn't match any lock document
2021-01-04T12:18:29.612+0000 I  CONNPOOL [ReplicaSetMonitor-TaskExecutor] Connecting to node2:27018
2021-01-04T12:18:29.612+0000 I  CONNPOOL [ReplicaSetMonitor-TaskExecutor] Connecting to node1:27018
2021-01-04T12:18:29.612+0000 I  CONNPOOL [ReplicaSetMonitor-TaskExecutor] Connecting to node3:27018
2021-01-04T12:18:29.692+0000 I  NETWORK  [ReplicaSetMonitor-TaskExecutor] Confirmed replica set for rs01 is rs01/node1:27018,node2:27018,node3:27018
2021-01-04T12:18:29.692+0000 I  SHARDING [UpdateReplicaSetOnConfigServer] Updating sharding state with confirmed set rs01/node1:27018,node2:27018,node3:27018
2021-01-04T12:18:31.495+0000 I  FTDC     [mongosMain] Initializing full-time diagnostic data capture with directory '/log/mongodb/mongos.diagnostic.data'
2021-01-04T12:18:31.497+0000 I  SH_REFR  [ConfigServerCatalogCacheLoader-0] Refresh for database config from version {} to version { uuid: UUID("b08a780c-0939-41ef-81de-8aee9e754776"), lastMod: 0 } took 0 ms
2021-01-04T12:18:31.497+0000 I  NETWORK  [listener] Listening on /tmp/mongodb-27017.sock
2021-01-04T12:18:31.497+0000 I  NETWORK  [listener] Listening on 0.0.0.0
2021-01-04T12:18:31.497+0000 I  NETWORK  [listener] waiting for connections on port 27017 ssl
2021-01-04T12:18:31.509+0000 I  SH_REFR  [ConfigServerCatalogCacheLoader-0] Refresh for collection config.system.sessions to version

Connecting directly to the data nodes works and couldn’t find any logs that could explain this behaviour in the config and data nodes.

Could you please help me understand what can be wrong here?

Thank you.

Seems this behaviour is described as a bug here: https://jira.mongodb.org/browse/SERVER-53540

Hi @Pedro_Albuquerque,

The symptoms you are describing sound similar to SERVER-53540 and SERVER-53337, but both are still open investigations that may have different causes (and have not been confirmed as bugs, yet). Proper investigation of your environment will require more details including diagnostic data.

I checked in with colleagues investigating the existing issues and confirmed you should open a new SERVER issue if you are able to share further details.

Regards,
Stennie