Index on the Time Series collection structure

Dhruvesh_Patel · November 20, 2020, 2:13pm

Hi

we are in process to store historical time series data (millions of rows) for Reuters ric. we are following " Scenario 3: Size-based bucketing" as per MongoDB's New Time Series Collections | MongoDB. We are storing 1000 quotes per document .

The structure of upsert query will be as following.

bulk.find({ric:"LCOJ1-V1",utc_date:ISODate("2020-08-13T00:00:00.000000000Z"),nquotes:{$lt:1000}}).upsert().updateOne(
{
	$push:{quotes:{utc_date_time:NumberInt("1597337417"),nano_seconds:NumberInt(784325554),ask_price:-0.98,ask_size:NumberInt(4),bid_price:-1.0,bid_size:NumberInt(4),qualifiers:["[BID_TONE]","[ASK_TONE]"]}},
	$setOnInsert:{exch_utc_offset:NumberInt(1)},
	$min:{quote_first:NumberLong("1597337417784")},
	$max:{quote_last:NumberLong("1597337417784")},
	$inc :{nquotes:NumberInt(1)}
});

bulk.find({ric:"LCOJ1-V1",utc_date:ISODate("2020-08-13T00:00:00.000000000Z"),nquotes:{$lt:1000}}).upsert().updateOne(
{
	$push:{quotes:{utc_date_time:NumberInt("1597337606"),nano_seconds:NumberInt(436207836),ask_price:-0.97,ask_size:NumberInt(4),bid_price:-1.0,bid_size:NumberInt(4)}},
	$setOnInsert:{exch_utc_offset:NumberInt(1)},
	$min:{quote_first:NumberLong("1597337606436")},
	$max:{quote_last:NumberLong("1597337606436")},
	$inc :{nquotes:NumberInt(1)}
});

bulk.find({ric:"LCOJ1-V1",utc_date:ISODate("2020-08-13T00:00:00.000000000Z"),nquotes:{$lt:1000}}).upsert().updateOne(
{
	$push:{quotes:{utc_date_time:NumberInt("1597337635"),nano_seconds:NumberInt(967713742),ask_price:-0.98,ask_size:NumberInt(4),bid_price:-1.0,bid_size:NumberInt(4)}},
	$setOnInsert:{exch_utc_offset:NumberInt(1)},
	$min:{quote_first:NumberLong("1597337635967")},
	$max:{quote_last:NumberLong("1597337635967")},
	$inc :{nquotes:NumberInt(1)}
});

Based on recommendation of article we have created composite index on ({ric:1,utc_date:1,nquotes:1}) which is perfectly fine for “upsert” operation.

But during the query time we are expected to query data using also “ric”,“utc_date”,“first_quote” and “last_quote”. So far we have created composite index on “ric,utc_date,nquotes”.

What would be the suggestion to create index if we are querying data using “ric”,“utc_date”,“first_quote” and “last_quote” ? Since we are processing millions of rows , we need to consider performance and required space for any additional index.

here is example read query

db.getCollection("time_series").find(
    { 
        "$and" : [
            { 
                "ric" : "CLV0-X0"
            }, 
            { 
                "utc_date" : ISODate("2020-08-03T00:00:00.000+0000")
            }, 
            { 
                "quote_first" : { 
                    "$gte" : NumberLong(1596493635301)
                }
            }, 
            { 
                "quote_first" : { 
                    "$lte" : NumberLong(1596499142016)
                }
            }, 
            { 
                "quote_last" : { 
                    "$lte" : NumberLong(1596499197995)
                }
            }
        ]
    }, 
    { 
        "ric" : 1.0, 
        "utc_date" : 1.0, 
        "quote_first" : 1.0, 
        "quote_last" : 1.0,
         "quotes" : 1.0
    }
).sort(
    { 
        "utc_date" : 1.0,"quote_first" : 1.0
    }
);

Thanks for looking into this.

Pavel_Duchovny · November 21, 2020, 5:25am

Hi @Dhruvesh_Patel,

Index field order can be initially determine using Equility Sort Range order. Recommend reading more here:

The equility fields in your query is ric and utc_date while range is qoute_first and qoute_last.

So optimal index could be:

{ ric: 1, utc_date : 1, qoute_first : 1, qoute_last : 1 }

Thanks
Pavel