Iterating through collection with hasNext() and next() stops at document ~565

Alban_Gerome · April 13, 2020, 12:28pm

Good morning,

I am new to MongoDB and new here, too. I have a collection of over 25 million documents in a collection. Each document consists of a simple JSON with a numeric id, a URL and an integer to store the status of that document. I was able to use a cursor and the forEach method but I want to test the URL and update the status if it’s an existing URL. It seems that I can’t stop a forEach loop before the whole collection has been processed. A while loop inside the forEach loop also gave me an error at document #1000, cursor not found. Therefore I am using the hasNext() and next() cursor methods instead:

I created this function below:

function iterate(cursor, db, maxDocCount, callBack){
  cursor.hasNext((error, result) => {
    if(error) throw error;
    if(result && docCount < maxDocCount){
      cursor.next((error, url2) => {
        if(error) throw error;
       //*just displaying the URLs for now, eventually will need to test the URL, update the status and only iterate again after the updated (real URL) or skipped (non-existing URL)
        console.log(rowIndex, url2);
        docCount++;
        iterate(cursor, db, callBack);
       //*/
      });
    }else callBack();
  });
};

The callback function is this:

() => {
  db.close();
  console.log("done in " + ((new Date() - start) / 1000) + " sec(s)");
}

The script works fine at first but then it stops at around document #565. It’s sometimes a few records earlier or a few ones later. The callback function never runs, it just stops and returns the command prompt. I have checked the MongoDB log file, it acknowledges the job but says nothing about why it stopped. I am on Windows 10 using Node in the command line.

I intend on using the request package to test the URLs. If the request returned a status code and no error then I want to update the status of that MongoDB document from 0 to 1. If that’s no valid URL I want to skip it. Then I increment the docCount variable and process the next document.

Am I doing something fundamentally wrong with my code? Is there a better way?

Thanks,

Alban

Alban_Gerome · April 16, 2020, 1:04pm

I was able to get past that issue in the end by splitting the data set into “pages” of 10 documents. 10 is an arbitrary number but by using limit() and skip() I was able to loop through the 10 documents on the current page, skip to the next page, rinse and repeat until the forEach loop reaches the last document. I was very impressed with the speed, too. Now I need my script to check the urls but that’s another challenge.

Prasad_Saya · April 17, 2020, 5:22am

I was able to use a cursor and the forEach method but I want to test the URL and update the status if it’s an existing URL

In general, to update documents in a collection based upon a condition, you use the one of the many update methods. All these update methods take a query filter to specify a condition.

See Update Documents.