MongoDB 4.4 update: setFeatureCompatibilityVersion: 4.4 is stuck due to ConflictingOperationInProgress

Hi,

Long time mongodb user, this is the first time I’m stuck on a weird issue…

I have a cluster with 5 shards, as:

 {  "_id" : "reports-z1-0",  "host" : "reports-z1-0/mongodb-shard-reports-0-0.mongodb-shard-reports-0.default:27018,mongodb-shard-reports-0-1.mongodb-shard-reports-0.default:27018,mongodb-shard-reports-0-2.mongodb-shard-reports-0.default:27018",  "state" : 1,  "tags" : [ "z1" ] }
        {  "_id" : "reports-z1-1",  "host" : "reports-z1-1/mongodb-shard-reports-1-0.mongodb-shard-reports-1.default:27018,mongodb-shard-reports-1-1.mongodb-shard-reports-1.default:27018,mongodb-shard-reports-1-2.mongodb-shard-reports-1.default:27018",  "state" : 1,  "tags" : [ "z1" ] }
        {  "_id" : "shard-z0-0",  "host" : "shard-z0-0/mongodb-shard-data-0-0.mongodb-shard-data-0.default:27018,mongodb-shard-data-0-1.mongodb-shard-data-0.default:27018,mongodb-shard-data-0-2.mongodb-shard-data-0.default:27018",  "state" : 1,  "tags" : [ "z0" ] }
        {  "_id" : "shard-z0-1",  "host" : "shard-z0-1/mongodb-shard-data-1-0.mongodb-shard-data-1.default:27018,mongodb-shard-data-1-1.mongodb-shard-data-1.default:27018,mongodb-shard-data-1-2.mongodb-shard-data-1.default:27018",  "state" : 1,  "tags" : [ "z0" ] }
        {  "_id" : "shard-z0-2",  "host" : "shard-z0-2/mongodb-shard-data-2-0.mongodb-shard-data-2.default:27018,mongodb-shard-data-2-1.mongodb-shard-data-2.default:27018,mongodb-shard-data-2-2.mongodb-shard-data-2.default:27018",  "state" : 1,  "tags" : [ "z0" ] }

I did upgrade from 4.2 to latest 4.4.2

When setting the

db.adminCommand( { setFeatureCompatibilityVersion: "4.4" } )

The command return

{
	"operationTime" : Timestamp(1607534165, 15),
	"ok" : 0,
	"errmsg" : "No chunks were found for the collection",
	"code" : 117,
	"codeName" : "ConflictingOperationInProgress",
	"$gleStats" : {
		"lastOpTime" : {
			"ts" : Timestamp(1607534165, 15),
			"t" : NumberLong(79)
		},
		"electionId" : ObjectId("7fffffff000000000000004f")
	},
	"lastCommittedOpTime" : Timestamp(1607534165, 15),
	"$configServerState" : {
		"opTime" : {
			"ts" : Timestamp(1607534165, 4),
			"t" : NumberLong(73)
		}
	},
	"$clusterTime" : {
		"clusterTime" : Timestamp(1607534165, 15),
		"signature" : {
			"hash" : BinData(0,"nQDtxjtK93fj0wnyN8Wy19Phb9U="),
			"keyId" : NumberLong("6860912798809980930")
		}
	}
}

I checked all nodes by hand and some are showing:

The server generated these startup warnings when booting:
        2020-12-09T17:13:20.379+00:00: A featureCompatibilityVersion upgrade did not complete. To fix this, use the setFeatureCompatibilityVersion command to resume upgrade to 4.4
        2020-12-09T17:13:20.379+00:00:         currentfeatureCompatibilityVersion: upgrading to 4.4

and

db.adminCommand( { getParameter: 1, featureCompatibilityVersion: 1 } )
{
	"featureCompatibilityVersion" : {
		"version" : "4.2",
		"targetVersion" : "4.4"
	},
	"ok" : 1,
	"$gleStats" : {
		"lastOpTime" : Timestamp(0, 0),
		"electionId" : ObjectId("7fffffff0000000000000049")
	},
	"lastCommittedOpTime" : Timestamp(1607534584, 1),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1607534584, 1),
		"signature" : {
			"hash" : BinData(0,"MfD/ygGU2rnKml3T/d91iImtIdk="),
			"keyId" : NumberLong("6860912798809980930")
		}
	},
	"operationTime" : Timestamp(1607534584, 1)
}

The config node and shard-z0-0 are stuck in that state.

shard-z0-1 and shard-z0-2 still show FCV set to 4.2

the reports-z1-0 and reports-z1-1 show the correct FCV of 4.4.

I tried to restart everything it doesnt help.

Due to that all chunk splitting are stuck:

"ConflictingOperationInProgress: Chunks cannot be split while a feature compatibility version upgrade or downgrade is in progress"

I did spot that message on data nodes:

"ctx":"initandlisten","msg":"A featureCompatibilityVersion upgrade did not complete. To fix this, use the setFeatureCompatibilityVersion command to resume upgrade to 4.4","attr":{"currentfeatureCompatibilityVersion":"upgrading to 4.4"},"tags":["startupWarnings"]}

How can I recover from that state?

Thanks

Adding a bit of logs from data node:

{"t":{"$date":"2020-12-09T20:59:13.160+00:00"},"s":"I",  "c":"SH_REFR",  "id":24103,   "ctx":"ConfigServerCatalogCacheLoader-14","msg":"Error refreshing cached collection","attr":{"namespace":"config.system.sessions","durationMillis":1,"error":"ConflictingOperationInProgress: No chunks were found for the collection"}}
{"t":{"$date":"2020-12-09T20:59:13.162+00:00"},"s":"I",  "c":"SH_REFR",  "id":24103,   "ctx":"ConfigServerCatalogCacheLoader-14","msg":"Error refreshing cached collection","attr":{"namespace":"config.system.sessions","durationMillis":1,"error":"ConflictingOperationInProgress: No chunks were found for the collection"}}
{"t":{"$date":"2020-12-09T20:59:13.164+00:00"},"s":"I",  "c":"SH_REFR",  "id":24103,   "ctx":"ConfigServerCatalogCacheLoader-14","msg":"Error refreshing cached collection","attr":{"namespace":"config.system.sessions","durationMillis":1,"error":"ConflictingOperationInProgress: No chunks were found for the collection"}}