MongoDB Aggregation Framework versus Pandas using PyMongo?

Hello,

I have set of raw MongoDB collections and now for reporting purposes, need to create set of aggregated collections which will have several computed/calculated values (measures), grouping (dimensions) i.e. a computed pattern with structure similar to star schema style:

[
  {
   "measure":123
   "dim_1":{
                  "dim1.col1": "xyz"
                },
   "dim_2":{
                  "dim2.col1": "abc"
                }  
  },
 {
   "measure":234
   "dim_1":{
                  "dim1.col1": "yyz"
                },
   "dim_2":{
                  "dim2.col1": "def"
                }  
  }
]

Please suggest if using MongoDB Aggregation Framework is the right choice here or using Pandas library for all the transformations and computations and then simply load/insert into MongoDB collection using PyMongo ? Which is the most efficient method ?

Appreciate your inputs.

Thanks!

Using Pandas will require you to copy all the data into the client and process it locally. For a large dataset this can be a significant overhead. With the aggregation framework the processing is done inside the database cluster with no network transfer or local storage overheads. You can also easily streams your results into another collection using $out.

1 Like

Thanks, Aggregation framework also seems to have certain result + memory size limitations. However, is it correct that when allowDiskUse is set to True, it would take care of these restrictions implicitly ? And since we would be pushing down the large processing overhead onto the MongoDB server side, what is the minimum DB capacity w.r.t RAM, storage etc preferred?

Hard to predict required memory without know the dataset size, the queries and doing a full working set calculation. If you are using Atlas it is relatively easy to try out different configurations and see what memory configuration works.