Hi
we are in process to store historical time series data (millions of rows) for Reuters ric. we are following " Scenario 3: Size-based bucketing" as per MongoDB's New Time Series Collections | MongoDB. We are storing 1000 quotes per document .
The structure of upsert query will be as following.
bulk.find({ric:"LCOJ1-V1",utc_date:ISODate("2020-08-13T00:00:00.000000000Z"),nquotes:{$lt:1000}}).upsert().updateOne(
{
$push:{quotes:{utc_date_time:NumberInt("1597337417"),nano_seconds:NumberInt(784325554),ask_price:-0.98,ask_size:NumberInt(4),bid_price:-1.0,bid_size:NumberInt(4),qualifiers:["[BID_TONE]","[ASK_TONE]"]}},
$setOnInsert:{exch_utc_offset:NumberInt(1)},
$min:{quote_first:NumberLong("1597337417784")},
$max:{quote_last:NumberLong("1597337417784")},
$inc :{nquotes:NumberInt(1)}
});
bulk.find({ric:"LCOJ1-V1",utc_date:ISODate("2020-08-13T00:00:00.000000000Z"),nquotes:{$lt:1000}}).upsert().updateOne(
{
$push:{quotes:{utc_date_time:NumberInt("1597337606"),nano_seconds:NumberInt(436207836),ask_price:-0.97,ask_size:NumberInt(4),bid_price:-1.0,bid_size:NumberInt(4)}},
$setOnInsert:{exch_utc_offset:NumberInt(1)},
$min:{quote_first:NumberLong("1597337606436")},
$max:{quote_last:NumberLong("1597337606436")},
$inc :{nquotes:NumberInt(1)}
});
bulk.find({ric:"LCOJ1-V1",utc_date:ISODate("2020-08-13T00:00:00.000000000Z"),nquotes:{$lt:1000}}).upsert().updateOne(
{
$push:{quotes:{utc_date_time:NumberInt("1597337635"),nano_seconds:NumberInt(967713742),ask_price:-0.98,ask_size:NumberInt(4),bid_price:-1.0,bid_size:NumberInt(4)}},
$setOnInsert:{exch_utc_offset:NumberInt(1)},
$min:{quote_first:NumberLong("1597337635967")},
$max:{quote_last:NumberLong("1597337635967")},
$inc :{nquotes:NumberInt(1)}
});
Based on recommendation of article we have created composite index on ({ric:1,utc_date:1,nquotes:1}) which is perfectly fine for “upsert” operation.
But during the query time we are expected to query data using also “ric”,“utc_date”,“first_quote” and “last_quote”. So far we have created composite index on “ric,utc_date,nquotes”.
What would be the suggestion to create index if we are querying data using “ric”,“utc_date”,“first_quote” and “last_quote” ? Since we are processing millions of rows , we need to consider performance and required space for any additional index.
here is example read query
db.getCollection("time_series").find(
{
"$and" : [
{
"ric" : "CLV0-X0"
},
{
"utc_date" : ISODate("2020-08-03T00:00:00.000+0000")
},
{
"quote_first" : {
"$gte" : NumberLong(1596493635301)
}
},
{
"quote_first" : {
"$lte" : NumberLong(1596499142016)
}
},
{
"quote_last" : {
"$lte" : NumberLong(1596499197995)
}
}
]
},
{
"ric" : 1.0,
"utc_date" : 1.0,
"quote_first" : 1.0,
"quote_last" : 1.0,
"quotes" : 1.0
}
).sort(
{
"utc_date" : 1.0,"quote_first" : 1.0
}
);
Thanks for looking into this.