Is it faster to stream the data and use deleteOne or use deleteMany

I have a collection with 100 million documents. I want to delete every document where the guildID isn’t in an array (the array has 200k elements). I’m using mongoose.

Would it be faster to:

  1. Stream all the data from the collection (using mongoose). Once I recieve each document I check if the guildID is in the array. If it’s not in the array I use deleteOne().

  2. Use deleteMany with $nin. I’m currently using this method and it’s taking 10 minutes to delete 1 million documents.

I would guess that deleteMany is much faster.

Both scenarios assume the deletion of the 10 million documents that matches the delete criteria. This is to simplify the understanding.

Scenario 1 - stream

  • you send one query to retrieve all docs
  • the server fetches 100M docs from permanent storage
  • the server sends 100M docs over to client
  • the client sends 10M delete commands to the server
  • the server performs 10M delete
  • you could improve this by streaming only _id and guildID

Scenario 2 - deleteMany

  • you send one command to deleteMany with an array of 200k elements (ouch)
  • the server also fetches the 100M docs but mark them as deleted right away
  • the main problem is that I do not think the index on guildID, if any, can be used because the query is $nin

That’s a very interesting problem. Please do follow up on your findings.

My question is what is the source of the array of 200k elements? Is it another collection in the same database?

I would try to find a way to use $in rather than $nin by getting the guildID that needs to be deleted rather than using the guildID that needs to be kept. Specially if an index on guildID exists.