Aggregation Pipeline: Applying Benford's Law to COVID-19 Data

Introduction

In this blog post, I will show you how I built an aggregation pipeline to apply Benford's law on the COVID-19 data set that we have made available in the following cluster:

Code Snippet

If you want to know more about this cluster and how we transformed the CSV files from Johns Hopkins University's repository into clean MongoDB documents, check out this blog post.

Finally, based on this pipeline, I was able to produce a dashboard in MongoDB Charts. For example, here is one Chart that applies Benford's law on the worldwide daily cases of COVID-19:

Disclaimer: This article will focus on the aggregation pipeline and the stages I used to produce the result I wanted to get to be able to produce these charts—not so much on the results themselves, which can be interpreted in many different ways. One of the many issues here is the lack of data. The pandemic didn't start at the same time in all the countries, so many countries don't have enough data to make the percentages accurate. But feel free to interpret these results the way you want...

Prerequisites

This blog post assumes that you already know the main principles of the aggregation pipeline and you are already familiar with the most common stages.

If you want to follow along, feel free to use the cluster mentioned above or take a copy using mongodump or mongoexport, but the main takeaway from this blog post is the techniques I used to produce the output I wanted.

Also, I can't recommend you enough to use the aggregation pipeline builder in MongoDB Atlas or Compass to build your pipelines and play with the ones you will see in this blog post.

All the code is available in this repository.

What is Benford's Law?

Before we go any further, let me tell you a bit more about Benford's law. What does Wikipedia say?

Benford's law [...] is an observation about the frequency distribution of leading digits in many real-life sets of numerical data. The law states that in many naturally occurring collections of numbers, the leading digit is likely to be small. In sets that obey the law, the number 1 appears as the leading significant digit about 30% of the time, while 9 appears as the leading significant digit less than 5% of the time. If the digits were distributed uniformly, they would each occur about 11.1% of the time. Benford's law also makes predictions about the distribution of second digits, third digits, digit combinations, and so on.

Here is the frequency distribution of the first digits that we can expect for a data set that respects Benford's law:

A little further down in Wikipedia's article, in the "Applications" section, you can also read the following:

Accounting fraud detection

In 1972, Hal Varian suggested that the law could be used to detect possible fraud in lists of socio-economic data submitted in support of public planning decisions. Based on the plausible assumption that people who fabricate figures tend to distribute their digits fairly uniformly, a simple comparison of first-digit frequency distribution from the data with the expected distribution according to Benford's law ought to show up any anomalous results.

Simply, if your data set distribution is following Benford's law, then it's theoretically possible to detect fraudulent data if a particular subset of the data doesn't follow the law.

In our situation, based on the observation of the first chart above, it looks like the worldwide daily confirmed cases of COVID-19 are following Benford's law. But is it true for each country?

If I want to answer this question (I don't), I will have to build a relatively complex aggregation pipeline (I do 😄).

The Data Set

I will only focus on a single collection in this blog post: covid19.countries_summary.

As its name suggests, it's a collection that I built (also using an aggregation pipeline) that contains a daily document for each country in the data set.

Here is an example:

Code Snippet

As you can see, for each day and country, I have daily counts of the COVID-19 confirmed cases and deaths.

The Aggregation Pipeline

Let's apply Benford's law on these two series of numbers.

The Final Documents

Before we start applying stages (transformations) to our documents, let's define the shape of the final documents which will make it easy to plot in MongoDB Charts.

It's easy to do and defines clearly where to start (the document in the previous section) and where we are going:

Code Snippet

Setting the final objective makes us focused on the target while doing our successive transformations.

The Pipeline in English

Now that we have a starting and an ending point, let's try to write our pipeline in English first:

Regroup all the first digits of each count into an array for the confirmed cases and into another one for the deaths for each country.
Clean the arrays (remove zeros and negative numbers—see note below).
Calculate the size of these arrays.
Remove countries with empty arrays (countries without cases or deaths).
Calculate the percentages of 1s, 2s, ..., 9s in each arrays.
Add a fake country "BenfordTheory" with the theoretical values of 1s, 2s, etc. we are supposed to find.
Final projection to get the document in the final shape I want.

Note: The daily fields that I provide in this collection covid19.countries_summary are computed from the cumulative counts that Johns Hopkins University (JHU) provides. Simply: Today's count, for each country, is today's cumulative count minus yesterday's cumulative count. In theory, I should have zeros (no deaths or no cases that day), but never negative numbers. But sometimes, JHU applies corrections on the counts without applying them retroactively in the past (as these counts were official counts at some point in time, I guess). So, negative values exist and I chose to ignore them in this pipeline.

Now that we have a plan, let's execute it. Each of the points in the above list is an aggregation pipeline stage, and now we "just" have to translate them.

Stage 1: Arrays of Leading Digits

First, I need to be able to extract the first character of $confirmed_daily, which is an integer.

MongoDB provides a $substring operator which we can use if we transform this integer into a string. This is easy to do with the $toString operator.

Code Snippet

Then, apply this transformation to each country and regroup ($group) the result into an array using $push.

Here is the first stage:

Code Snippet

Here is the shape of my documents at this point if I apply this transformation:

Code Snippet

Stage 2: Clean the Arrays

As mentioned above, my arrays might contains zeros and - which is the leading character of a negative number. I decided to ignore this for my little mathematical experimentation.

If I now translate "clean the arrays" into something more "computer-friendly," what I actually want to do is "filter the arrays." We can leverage the $filter operator and overwrite our existing arrays with their filtered versions without zeros and dashes by using the $addFields stage.

Code Snippet

At this point, our documents in the pipeline have the same shape as previously.

Stage 3: Array Sizes

The final goal here is to calculate the percentages of 1s, 2s, ..., 9s in these two arrays, respectively. To compute this, I will need the size of the arrays to apply the rule of three.

This stage is easy as $size does exactly that.

Code Snippet

To be completely honest, I could compute this on the fly later, when I actually need it. But I'll need it multiple times later on, and this stage is inexpensive and eases my mind so... Let's KISS.

Here is the shape of our documents at this point:

Code Snippet

As you can see for Japan, our arrays are relatively long, so we could expect our percentages to be somewhat accurate.

It's far from being true for all the countries...

Code Snippet

Stage 4: Eliminate Countries with Empty Arrays

I'm not good enough at math to decide which size is significant enough to be statistically accurate, but good enough to know that my rule of three will need to divide by the size of the array.

As dividing by zero is bad for health, I need to remove empty arrays. A sound statistician would probably also remove the small arrays... but not me 😅.

This stage is a trivial $match:

Code Snippet

Stage 5: Percentages of Digits

We are finally at the central stage of our pipeline. I need to apply a rule of three to calculate the percentage of 1s in an array:

Find how many 1s are in the array.
Multiply by 100.
Divide by the size of the array.
Round the final percentage to one decimal place. (I don't need more precision for my charts.)

Then, I need to repeat this operation for each digit and each array.

To find how many times a digit appears in the array, I can reuse techniques we learned earlier:

Code Snippet

I'm creating a new array which contains only the 1s with $filter and I calculate its size with $size.

Now I can $multiply this value (let's name it X) by 100, $divide by the size of the confirmed array, and $round the final result to one decimal.

Code Snippet

As a reminder, here is the final document we want:

Code Snippet

The value we just calculated above corresponds to the 22.3 that we have in this document.

At this point, we just need to repeat this operation nine times for each digit of the confirmed array and nine other times for the deaths array and assign the results accordingly in the new benford array of documents.

Here is what it looks like in the end:

Code Snippet

{
  "$addFields": {
    "benford": [
      {
        "digit": 1,
        "confirmed": {
          "$round": [
            {
              "$divide": [
                {
                  "$multiply": [
                    100,
                    {
                      "$size": {
                        "$filter": {
                          "input": "$confirmed",
                          "as": "elem",
                          "cond": {
                            "$eq": [
                              "$$elem",
                              "1"
                            ]
                          }
                        }
                      }
                    }
                  ]
                },
                "$confirmed_size"
              ]
            },
            1
          ]
        },
        "deaths": {
          "$round": [
            {
              "$divide": [
                {
                  "$multiply": [
                    100,
                    {
                      "$size": {
                        "$filter": {
                          "input": "$deaths",
                          "as": "elem",
                          "cond": {
                            "$eq": [
                              "$$elem",
                              "1"
                            ]
                          }
                        }
                      }
                    }
                  ]
                },
                "$deaths_size"
              ]
            },
            1
          ]
        }
      },
      {"digit": 2...},
      {"digit": 3...},
      {"digit": 4...},
      {"digit": 5...},
      {"digit": 6...},
      {"digit": 7...},
      {"digit": 8...},
      {"digit": 9...}
    ]
  }
}

At this point in our pipeline, our documents look like this:

Code Snippet

{
  _id: 'Luxembourg',
  confirmed: [
    '1', '5', '2', '1', '1', '4', '3', '1', '2', '5', '8', '4',
    '1', '4', '1', '1', '1', '2', '3', '1', '9', '5', '3', '2',
    '2', '2', '1', '7', '4', '1', '2', '5', '1', '2', '1', '8',
    '9', '6', '8', '1', '1', '3', '7', '8', '6', '6', '4', '2',
    '2', '1', '1', '1', '9', '5', '8', '2', '2', '6', '1', '6',
    '4', '8', '5', '4', '1', '2', '1', '3', '1', '4', '1', '1',
    '3', '3', '2', '1', '2', '2', '3', '2', '1', '1', '1', '3',
    '1', '7', '4', '5', '4', '1', '1', '1', '1', '1', '7', '9',
    '1', '4', '4', '8',
    ... 242 more items
  ],
  deaths: [
    '1', '1', '8', '9', '2', '3', '4', '1', '3', '5', '5', '1',
    '3', '4', '2', '5', '2', '7', '1', '1', '5', '1', '2', '2',
    '2', '9', '6', '1', '1', '2', '5', '3', '5', '1', '3', '3',
    '1', '3', '3', '4', '1', '1', '2', '4', '1', '2', '2', '1',
    '4', '4', '1', '3', '6', '5', '8', '1', '3', '2', '7', '1',
    '6', '8', '6', '3', '1', '2', '6', '4', '6', '8', '1', '1',
    '2', '3', '7', '1', '8', '2', '1', '6', '3', '3', '6', '2',
    '2', '2', '3', '3', '3', '2', '6', '3', '1', '3', '2', '1',
    '1', '4', '1', '1',
    ... 86 more items
  ],
  confirmed_size: 342,
  deaths_size: 186,
  benford: [
    { digit: 1, confirmed: 36.3, deaths: 32.8 },
    { digit: 2, confirmed: 16.4, deaths: 19.9 },
    { digit: 3, confirmed: 9.1, deaths: 14.5 },
    { digit: 4, confirmed: 8.8, deaths: 7.5 },
    { digit: 5, confirmed: 6.4, deaths: 6.5 },
    { digit: 6, confirmed: 9.6, deaths: 8.6 },
    { digit: 7, confirmed: 5.8, deaths: 3.8 },
    { digit: 8, confirmed: 5, deaths: 4.8 },
    { digit: 9, confirmed: 2.6, deaths: 1.6 }
  ]
}

Note: At this point, we don't need the arrays anymore. The target document is almost there.

Stage 6: Introduce Fake Country BenfordTheory

In my final charts, I wanted to be able to also display the Bendord's theoretical values, alongside the actual values from the different countries to be able to spot easily which one is potentially producing fake data (modulo the statistic noise and many other reasons).

Just to give you an idea, it looks like, globally, all the countries are producing legit data but some arrays are small and produce "statistical accidents."

To be able to insert this "perfect" document, I need to introduce in my pipeline a fake and perfect country that has the perfect percentages. I decided to name it "BenfordTheory."

But (because there is always one), as far as I know, there is no stage that can just let me insert a new document like this in my pipeline.

So close...

Lucky for me, I found a workaround to this problem with the new (since 4.4) $unionWith stage. All I have to do is insert my made-up document into a collection and I can "insert" all the documents from this collection into my pipeline at this stage.

I inserted my fake document into the new collection randomly named benford. Note that I made this document look like the documents at this current stage in my pipeline. I didn't care to insert the two arrays because I'm about to discard them anyway.

Code Snippet

With this new collection in place, all I need to do is $unionWith it.

Code Snippet

Stage 7: Final Projection

At this point, our documents look almost like the initial target document that we have set at the beginning of this blog post. Two differences though:

The name of the countries is in the _id key, not the country one.
The two arrays are still here.

We can fix this with a simple $project stage.

Code Snippet

Note that I chose which field should be here or not in the final document by inclusion here. _id is an exception and needs to be explicitly excluded. As the two arrays aren't explicitly included, they are excluded by default, like any other field that would be there. See considerations.

Here is our final result:

Code Snippet

And please remember that some documents still look like this in the pipeline because I didn't bother to filter them:

Code Snippet

The Final Pipeline

My final pipeline is pretty long due to the fact that I'm repeating the same block for each digit and each array for a total of 9*2=18 times.

I wrote a factorised version in JavaScript that can be executed in mongosh:

Code Snippet

use covid19;

let groupBy = {
  "$group": {
    "_id": "$country",
    "confirmed": {
      "$push": {
        "$substr": [{
          "$toString": "$confirmed_daily"
        }, 0, 1]
      }
    },
    "deaths": {
      "$push": {
        "$substr": [{
          "$toString": "$deaths_daily"
        }, 0, 1]
      }
    }
  }
};

let createConfirmedAndDeathsArrays = {
  "$addFields": {
    "confirmed": {
      "$filter": {
        "input": "$confirmed",
        "as": "elem",
        "cond": {
          "$and": [{
            "$ne": ["$$elem", "0"]
          }, {
            "$ne": ["$$elem", "-"]
          }]
        }
      }
    },
    "deaths": {
      "$filter": {
        "input": "$deaths",
        "as": "elem",
        "cond": {
          "$and": [{
            "$ne": ["$$elem", "0"]
          }, {
            "$ne": ["$$elem", "-"]
          }]
        }
      }
    }
  }
};

let addArraySizes = {
  "$addFields": {
    "confirmed_size": {
      "$size": "$confirmed"
    },
    "deaths_size": {
      "$size": "$deaths"
    }
  }
};

let removeCountriesWithoutConfirmedCasesAndDeaths = {
  "$match": {
    "confirmed_size": {
      "$gt": 0
    },
    "deaths_size": {
      "$gt": 0
    }
  }
};

function calculatePercentage(inputArray, digit, sizeArray) {
  return {
    "$round": [{
      "$divide": [{
        "$multiply": [100, {
          "$size": {
            "$filter": {
              "input": inputArray,
              "as": "elem",
              "cond": {
                "$eq": ["$$elem", digit]
              }
            }
          }
        }]
      }, sizeArray]
    }, 1]
  }
}

function calculatePercentageConfirmed(digit) {
  return calculatePercentage("$confirmed", digit, "$confirmed_size");
}

function calculatePercentageDeaths(digit) {
  return calculatePercentage("$deaths", digit, "$deaths_size");
}

let calculateBenfordPercentagesConfirmedAndDeaths = {
  "$addFields": {
    "benford": [{
      "digit": 1,
      "confirmed": calculatePercentageConfirmed("1"),
      "deaths": calculatePercentageDeaths("1")
    }, {
      "digit": 2,
      "confirmed": calculatePercentageConfirmed("2"),
      "deaths": calculatePercentageDeaths("2")
    }, {
      "digit": 3,
      "confirmed": calculatePercentageConfirmed("3"),
      "deaths": calculatePercentageDeaths("3")
    }, {
      "digit": 4,
      "confirmed": calculatePercentageConfirmed("4"),
      "deaths": calculatePercentageDeaths("4")
    }, {
      "digit": 5,
      "confirmed": calculatePercentageConfirmed("5"),
      "deaths": calculatePercentageDeaths("5")
    }, {
      "digit": 6,
      "confirmed": calculatePercentageConfirmed("6"),
      "deaths": calculatePercentageDeaths("6")
    }, {
      "digit": 7,
      "confirmed": calculatePercentageConfirmed("7"),
      "deaths": calculatePercentageDeaths("7")
    }, {
      "digit": 8,
      "confirmed": calculatePercentageConfirmed("8"),
      "deaths": calculatePercentageDeaths("8")
    }, {
      "digit": 9,
      "confirmed": calculatePercentageConfirmed("9"),
      "deaths": calculatePercentageDeaths("9")
    }]
  }
};

let unionBenfordTheoreticalValues = {
  "$unionWith": {
    "coll": "benford"
  }
};

let finalProjection = {
  "$project": {
    "country": "$_id",
    "_id": 0,
    "benford": 1,
    "confirmed_size": 1,
    "deaths_size": 1
  }
};

let pipeline = [groupBy,
                createConfirmedAndDeathsArrays,
                addArraySizes,
                removeCountriesWithoutConfirmedCasesAndDeaths,
                calculateBenfordPercentagesConfirmedAndDeaths,
                unionBenfordTheoreticalValues,
                finalProjection];

let cursor = db.countries_summary.aggregate(pipeline);

printjson(cursor.next());

If you want to read the entire pipeline, it's available in this github repository.

If you want to see more visually how this pipeline works step by step, import it in MongoDB Compass once you are connected to the cluster (see the URI in the Introduction). Use the New Pipeline From Text option in the covid19.countries_summary collection to import it.

An Even Better Pipeline?

Did you think that this pipeline I just presented was perfect?

Well well... It's definitely getting the job done, but we can make it better in many ways. I already mentioned in this blog post that we could remove Stage 3, for example, if we wanted to. It might not be as optimal, but it would be shorter.

Also, there is still Stage 5, in which I literally copy and paste the same piece of code 18 times... and Stage 6, where I have to use a workaround to insert a document in my pipeline.

Another solution could be to rewrite this pipeline with a $facet stage and execute two sub-pipelines in parallel to compute the results we want for the confirmed array and the deaths array. But this solution is actually about two times slower.

However, my colleague John Page came up with this pipeline that is just better than mine because it's applying more or less the same algorithm, but it's not repeating itself. The code is a lot cleaner and I just love it, so I thought I would also share it with you.

John is using very smartly a $map stage to iterate over the nine digits which makes the code a lot simpler to maintain.

Wrap-Up

In this blog post, I tried my best to share with you the process of creating a relatively complex aggregation pipeline and a few tricks to transform as efficiently as possible your documents.

We talked about and used in a real pipeline the following aggregation pipeline stages and operators:

If you are a statistician and you can make sense of these results, please post a message on the Community Forum and ping me!

Also, let me know if you can find out if some countries are clearly generating fake data.

If you have questions, please head to our developer community website where the MongoDB engineers and the MongoDB community will help you build your next big idea with MongoDB.

MongoDB

Aggregation Pipeline: Applying Benford's Law to COVID-19 Data

Introduction

Prerequisites

What is Benford's Law?

The Data Set

The Aggregation Pipeline

The Final Documents

The Pipeline in English

Stage 1: Arrays of Leading Digits

Stage 2: Clean the Arrays

Stage 3: Array Sizes

Stage 4: Eliminate Countries with Empty Arrays

Stage 5: Percentages of Digits

Stage 6: Introduce Fake Country BenfordTheory

Stage 7: Final Projection

The Final Pipeline

An Even Better Pipeline?

Wrap-Up

Related

Testing and Packaging a Python Library

Build a RESTful API with HapiJS and MongoDB

Halting Development on MongoDB Swift Driver

Single-Collection Designs in MongoDB with Spring Data (Part 2)

Table of Contents