Percentile with group aggregation

#data

    { "array" : 112233, "hgName" : "host1",  "iops" : 1, "mbps" : 0.001464844, "responseTime" : 0.06, "timeStamp" : NumberLong("1572665400110") }
    { "array" : 112233, "hgName" : "host1",  "iops" : 1, "mbps" : 0.001464844, "responseTime" : 0.06, "timeStamp" : NumberLong("1572666600117") }
    { "array" : 112233, "hgName" : "host2",  "iops" : 1, "mbps" : 0.001464844, "responseTime" : 0.06, "timeStamp" : NumberLong("1572661800115") }
    { "array" : 112233, "hgName" : "host3",  "iops" : 1, "mbps" : 0.001464844, "responseTime" : 0.06, "timeStamp" : NumberLong("1572666900103") }
    { "array" : 112233, "hgName" : "host3",  "iops" : 1, "mbps" : 0.001464844, "responseTime" : 0.06, "timeStamp" : NumberLong("1572667200110") }
    <--Truncated-->

Current Query:

    db.MyCollections.aggregate([
      { 
       $match: {}    
      }, {
        $group: {
          _id: "$hgName",
          average_IOPS: {
            $avg: "$iops"
          },
    	  average_MBps: {
            $avg: "$mbps"
          },
    	  Max_MBPS: {$max: "$mbps"},
    	  Max_IOPS: {$max: "$iops"}
    	}
      }
    ]);
    { "_id" : "host1", "average_IOPS" : 32.26, "average_MBps" : 11.1, "Max_MBPS" : 664.8, "Max_IOPS" : 1674 }
    { "_id" : "host2", "average_IOPS" : 54.44, "average_MBps" : 0.19, "Max_MBPS" : 13.96, "Max_IOPS" : 3339 }
    { "_id" : "host3", "average_IOPS" : 0.909, "average_MBps" : 0.00, "Max_MBPS" : 0.002, "Max_IOPS" : 1 }

I managed to get the aggregation done this level for a timeseries data,

I need to include a "percentile (0.95) calculation for iops and mbps " same as average figures consolidation by host.

i.e percentile_95_iops and percentile_95_mbps along with other results.

Thanks and Regards
Vipin

1 Like

Hi Vipin,

Have you considered storing the time series data in a different schema pattern? Having a document per record is very relational and MongoDB has several great (non-relational) alternatives for more efficiently storing this information. I recommend checking out this blog post:

Storing the information differently removes the need to aggregate at all. If all the information needed by the application is available within a single document, calculating the 95th percentile is a much simpler (and very client-side) operation.

1 Like

Hi Vipin,

Thank you for posting this here. I agree with @Justin that you would benefit from the article he is pointing you at. As to the percentile calculations for mbps and iops…perhaps a more experienced person knows the math tricks needed with that.

I did copy your sample data and put it in a collection on my own MongoDB instance, but then I had to look up resources on how to do percentile calculations and I need to study these. Also, my skill with aggregation queries is rusty. I believe I must create separate aggregation queries: one for mbps and one for iops. The reason for separate aggregation queries is because you have to sort the documents in a sort stage prior to an actual calculation for percentile. So there will be one aggregation query for iops which sorts the values in a sort stage. A totally separate aggregation is needed to sort the values for mbps in its own sort stage.

I could have spent time experimenting with this, but I had some other things to do. I will try to experiment over this week and do my best, but others with more skill and experience than I could offer really good suggestions to you. I don’t know what you will do with the time series data e.g. the time stamp. It will help you to read that article that @Justin pointed you to.

Thanks so much

Bob

1 Like

FYI Vipin, the requested functionality (calculating percentile via aggregation) is being tracked by SERVER-7463.

1 Like

WOW , Mongo Community response is Amazing. Loving it.

@Robert_Cochran,

Many thanks.
Actually the maths part is easy in my view , but am not able to put it in queries after Step#1/2.
Can you guide me with this?
I tried to do it in many ways with group ,

  1. I have to take the IOPS values / group into Array ( like $push )
  2. Sort this in ASCENDING order.
  3. Take the Count of $push items / OR Get count of each Group item . in my case 288 in general.
  4. Get the Array content by Index(x) where X = floor(95% of Total Count )
  5. Display/project along with AVG/MAX counters.

@Justin,

Excellent Piece of info that will help for some optimizations with my data.
in my case, this is to generate a summary of many 1000’s of records.
I have a running version with JSON/Vanilla JS , including percentile but many limitations there.
Thought if all the things I could make use of DB, it would be great.

At alexbevi,

Great to know this. Tx
for many of the time series / Trouble shooting / performance data actually percentile makes a lot of sense than average. So it is a very important one and i will certainly keep a watch on this.

Thanks and Regards
Vipin

My two cents, you can use the computed pattern. Which is simply keep a rolling percentage as new values are added instead of calculating it every time.