Connection to primary is extremely slow or timeout

I just upgraded from 4.0.18 to 4.2.8, 3 server replica set, rhel 7. Connection to each secondary goes quickly, just as expected (this is subsecond). When I connect to the primary the connection takes well over 10 minutes. Applications have a timeout limit so they are definitely getting a timeout error. When I failover the primary to a new server, the secondary that just became the primary now has the connection delay. The server that was the primary that just became a secondary has the quick connection. The shell connection command is executed directly on the database server. Does anyone have any clues?

1 Like

Hi @JamesT welcome to the community.

This is a peculiar situation you described, and I’m not sure I’ve ever seen one. Well before 10 minutes, any typical connection attempt by any driver or the mongo shell would give up long ago.

Could you provide more details:

Connection to each secondary goes quickly, just as expected (this is subsecond). When I connect to the primary the connection takes well over 10 minutes.

How did you determine the timing? Could you post the logs from both the server and the client when your app tried to connect to the server?

The shell connection command is executed directly on the database server.

Could you elaborate on what you mean? Could you post the exact command you tried?

Please also post the output of rs.status() and rs.conf() during this extra-long connection attempts, so that your replica set topology and state can be determined.

Would also be helpful if you can tell us how did you install MongoDB.

Best regards,
Kevin

1 Like

I downgraded all set members to 4.0.18 and all activity returned to normal and the time to connect dropped back to subseconds. I do have many other environments that were upgraded in the same manner and none of them experienced this problem, including OpsMan sets. This one however is my most active but v4.0.18 could handle the traffic.

I cannot post the logs for security reasons.
Mongo was a manual rpm install via yum.

My longest connection attempt was Compass at 18 minutes; the item is from mongod.log.

2020-08-10T15:23:42.596-0400 I COMMAND [conn2686] command $external.$cmd appName: "MongoDB Compass" command: saslStart { saslStart: 1, mechanism: "PLAIN", payload: "xxx", autoAuthorize: 1, $db: "$external" } numYields:0 reslen:219 locks:{} protocol:op_query 1086333ms

When I run the shell command I try both by being logged directly into the server via a putty session and on Windows laptop command window (this one has 4.2.6). Below is one of the connection attempts that hangs. I’m using ldap for authentication. Ldap is not a contributor here as when I connect to the secondary, it still needs to authenticate.

mongo --host rsSet1/server8:27017,server9:27017,server10:27017 --authenticationMechanism 'PLAIN' --authenticationDatabase '$external' --ssl --username userA --password passwordA

I do not have any of the rs.status() output during the time of this issue as I was in the shell running it and didn’t save any in output files, but it looks pretty much as it does on any normal day. All replicaSet members were in a normal state and were communicating successfully with each other. Heartbeats and pings are good and no infoMessages. What would you be looking for within it, just curious for future reference?

rs.conf()
{
        "_id" : "rsSet1",
        "version" : 3,
        "protocolVersion" : NumberLong(1),
        "writeConcernMajorityJournalDefault" : true,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "server8:27017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {

                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 1,
                        "host" : "server9:27017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {

                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 2,
                        "host" : "server10:27017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {

                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                }
        ],
        "settings" : {
                "chainingAllowed" : true,
                "heartbeatIntervalMillis" : 2000,
                "heartbeatTimeoutSecs" : 10,
                "electionTimeoutMillis" : 10000,
                "catchUpTimeoutMillis" : -1,
                "catchUpTakeoverDelayMillis" : 30000,
                "getLastErrorModes" : {

                },
                "getLastErrorDefaults" : {
                        "w" : 1,
                        "wtimeout" : 0
                },
                "replicaSetId" : ObjectId("5bb60307b0f85f3386121cdf")
        }
}
1 Like

Hi @JamesT

It appears that the connection process may be stalling trying to get a reply from the LDAP server. MongoDB currently uses libldap, and defers all LDAP auth process to the library, so either something is amiss with how the LDAP setup interacts with the primary mongod, or there is something else going on there. It is curious though how older MongoDB doesn’t seem to experience this. Note that the new MongoDB 4.4.0 was just released and might be worth trying as well.

Having said that, LDAP connectivity is an Enterprise-only feature, so if you keep having this issue I would recommend you open a support case if this is a production environment. Investigating this issue may require a deeper dive into your specific setup.

Best regards,
Kevin