Slow query causing 100% CPU spike

Hello there, I will highly appreciate the community advice.
I’m running MongoDB atlas version 3.6.21 and using an open-source package called agenda that using mongo as its backend. once I upgrade the package (v2.0.2 to v4.0.1), the CPU spiked to 100%. according to the atlas profiler, I identified a possible ‘bad’ query causing the issue, but I can’t figure out what’s wrong.

The ‘bad’ query is findAndModify and runs the following:

db.getCollection('agendaJobs').findAndModify({"query": {
          $or: [
            {
              name: 'webhook',
              disabled: { $ne: true },
              lockedAt: { $eq: null },
              nextRunAt: { $lte: new Date() },
            },
            {
              name: 'webhook',
              disabled: { $ne: true },
              lockedAt: { $lte: new Date() },
            },
          ],
    },
    "sort": { "nextRunAt" : 1, "priority" : -1 },
    "update": { $set : { "lockedAt" : new Date() }}
})

from the profiler, those are the most suspicious fields I see:

  "millis": 471830
  "protocol": "op_msg",
  "locks": {
    "Global": {
      "acquireCount": {
        "r": 146620,
        "w": 146618
      }
    }, 
  "keysExamined": 405712,
  "docsExamined": 2,
  "nMatched": 1,
  "nModified": 1,
  "keysInserted": 1,
  "keysDeleted": 1,
  "writeConflicts": 14,
  "numYields": 146600,
  "planSummary": [
    {
      "IXSCAN": {
        "name": 1,
        "nextRunAt": 1,
        "priority": -1,
        "lockedAt": 1,
        "disabled": 1
      }
    },
    {
      "IXSCAN": {
        "name": 1,
        "nextRunAt": 1,
        "priority": -1,
        "lockedAt": 1,
        "disabled": 1
      }
    }
  ],

Is there anything I’m missing? maybe I should create an additional index? how would you suggest continuing the investigation?

I will highly appreciate your support and opinion!

Tal

Hi @Talik,

Welcome to MongoDB community.

Its not an easy task to say specifically of only this query is the root cause of the entire cpu spike.

Looking at the explain plan I see 2 problems:

  1. Ammount of keys scanned is huge compared to the updated and returned.
  2. There is a long global lock time this query is waiting for.

I think with the current way the query is written there is not much index tuning to be done.

But can you possibly rewrite it the following way:

db.getCollection('agendaJobs').findAndModify({"query": {name: 'webhook',
              disabled:  false,
          $or: [
            {
              lockedAt: { $eq: null },
              nextRunAt: { $lte: new Date() },
            },
            {
              lockedAt: { $lte: new Date() },
            },
          ],
    },
    "sort": { "nextRunAt" : 1, "priority" : -1 },
    "update": { $set : { "lockedAt" : new Date() }}
})

Then you could possibly indez the following way:

db.agendaJobs.createIndex({name : 1, nextRunAt : 1, priority : -1 , lockedAt : 1}, {partialFilterExpression : {disabled: false}}

Maybe this will allow a better key scanning avoiding the index intersection which is very cpu intensive.

Thanks,
Pavel

@Pavel_Duchovny thank you for your quick reply!
I’ve tried your suggestion, but I experience the same CPU issue as before.
this is the query I run:

db.getCollection('agendaJobs').findOneAndUpdate({"query": {
$and: [{
         name: jobName,
         disabled: { $ne: true }
     }, {
     $or: [{
           lockedAt: { $eq: null },
           nextRunAt: { $lte: this._nextScanAt }
      }, {
           lockedAt: { $lte: lockDeadline }
      }]
 }]
},
    "sort": { "nextRunAt" : 1, "priority" : -1 },
    "update": { $set : { "lockedAt" : new Date() }}
})

while sort and update are the same
I forgot to mention but I already have the following index in place:

    {
        "v" : 2,
        "key" : {
            "name" : 1,
            "nextRunAt" : 1,
            "priority" : -1,
            "lockedAt" : 1,
            "disabled" : 1
        },
        "name" : "findAndLockNextJobIndex",
        "ns" : "tick-tock-prod.agendaJobs"
 },

I’m a bit concerned about the writeConflicts field. could it be that from the findOneAndUpdate process the find query runs efficiently, but then the update process is trying to modify the same document? the acquireCount which is very high looks very suspicious as well.

Just to note that I’m running 2 node processes that reading from the same DB.

Hi @Talik,

This is not exactly what I recommend.

I recommend doing disabled: false and not disabled : { $ne : true }. Is it because documents that are not disabled does not havt this field?

How high is write conflicts, in the previous example it is 14…

If you concurrently hiting the same docs you may increase locking and cpu overhead.
But your main problem is with the way the query written the indexed intersection does not do good.

I would consider changing the data model to have more selective queries and updates (use $ne as less as possible and $or as well).

Thanks
Pavel

@Pavel_Duchovny sorry for the delay in response :pray:
I was trying multiple approaches, but unfortunately, nothing worked yet.
I’m using a library called agenda that manages the DB interaction and query, therefore I can’t implement your suggestion and do disabled:false instead of {$ne: true}, because documents that are not disabled, do not have this field.

the latest version of agenda generate the following query:

db.getCollection('agendaJobs').findOneAndUpdate({"query": {
$and: [{
         name: jobName,
         disabled: { $ne: true }
     }, {
     $or: [{
           lockedAt: { $eq: null },
           nextRunAt: { $lte: this._nextScanAt }
      }, {
           lockedAt: { $lte: lockDeadline }
      }]
 }]
},
    "sort": { "nextRunAt" : 1, "priority" : -1 },
    "update": { $set : { "lockedAt" : new Date() }}
})

before it was different:

{
      $or: [{
        name: jobName,
        lockedAt: null,
        nextRunAt: {$lte: new Date()},
        disabled: {$ne: true}
      }, {
        name: jobName,
        lockedAt: {$exists: false},
        nextRunAt: {$lte: new Date()},
        disabled: {$ne: true}
      }, {
        name: jobName,
        lockedAt: {$lte: lockDeadline},
        disabled: {$ne: true}
      }]
    };

I’m thinking, maybe we reorganize the fields on the index

from

{
    "name" : 1,
    "nextRunAt" : 1,
    "priority" : -1,
    "lockedAt" : 1,
    "disabled" : 1
}

to

{
    name: 1,
    disabled: 1,
    lockedAt: 1,
    nextRunAt: 1,
    priority: -1
}

WDYT?

Regarding your other question, I’ve noticed that the bad performance starts when the writeConflicts is higher than 5. when it was 25 for example, it took more than 15s for the query to finish.
Sharing here the execution stats, I hope it will be useful

 "planSummary": [
    {
      "IXSCAN": {
        "name": 1,
        "nextRunAt": 1,
        "priority": -1,
        "lockedAt": 1,
        "disabled": 1
      }
    },
    {
      "IXSCAN": {
        "name": 1,
        "nextRunAt": 1,
        "priority": -1,
        "lockedAt": 1,
        "disabled": 1
      }
    }
  ],
  "keysExamined": 408017,
  "docsExamined": 2,
  "nMatched": 1,
  "nModified": 1,
  "keysInserted": 1,
  "keysDeleted": 1,
  "writeConflicts": 25,
  "numYields": 84871,
  "reslen": 880,
  "locks": {
    "Global": {
      "acquireCount": {
        "r": 84900,
        "w": 84900
      }
    },
    "Database": {
      "acquireCount": {
        "w": 84900
      }
    },
    "Collection": {
      "acquireCount": {
        "w": 84898
      }
    },
    "oplog": {
      "acquireCount": {
        "w": 2
      }
    }
  },
  "protocol": "op_msg",
  "millis": 153063

I highly appreciate your help! thank you!

Hi @Talik,

If you cannot rewrite queries my ability to help you is very limited.

The proposed index might work.

Write conflicts can cause cpu exhaustion as the engine must perform each conflict in the chain and keep in memory all variations in memory.

Avoiding the conflicts is done by rewriting queries or data model to not hit the same object for a lock over and over.

Thanks
Pavel