Sort before group stage

I am wondering if i can use sort before grouping in this type of query because i want the sort stage to use the index on samples.timestamp1

mydb1.mongodbbucketnocpu.aggregate(
  [
      {
          "$match": {
              "samples.timestamp1": {"$gte": datetime.strptime("2010-01-01 00:00:00", "%Y-%m-%d %H:%M:%S"),
                        "$lte": datetime.strptime("2015-01-01 01:55:00", "%Y-%m-%d %H:%M:%S")},

          }
      },
{ "$unwind": "$samples" },

      {
          "$group": {

              "_id": {"$dateToString": {"format": "%Y-%m-%d %H", "date": "$samples.timestamp1"}},
              "max_id13": {
                  "$max": "$samples.id13"
              }
          }
      },
      {
          "$project": {
              "_id": 0,
              "day": "$_id",
              "max_id13": 1
          }
      },
      {"$sort": {"hour": -1}},
     { "$limit": 5}


    ]
)

My data contains about 96k documents that contain 12 subdocuments each.

{
	"_id" : ObjectId("607f185f2a477a621641cded"),
	"nsamples" : 12,
	"samples" : [
		{
			"id1" : 3758,
			"id6" : 2,
			"id7" : -79.09,
			"id8" : 35.97,
			"id9" : 5.5,
			"id10" : 0,
			"id11" : -99999,
			"id12" : 0,
			"id13" : -9999,
			"c14" : "U",
			"id15" : 0,
			"id16" : 99,
			"id17" : 0,
			"id18" : -99,
			"id19" : -9999,
			"id20" : 33,
			"id21" : 0,
			"id22" : -99,
			"id23" : 0,
			"timestamp1" : ISODate("2010-01-01T00:00:00Z"),
			"timestamp2" : ISODate("2009-12-31T19:05:00Z")
		},
		{
			"id1" : 3758,
			"id6" : 2,
        .
        .
        .

When i tried using {"$sort": {"samples.timestamp1": -1}}, before group stage my output was thisSort exceeded memory limit of 104857600 bytes
Is ti possible to use sort before group?What do you think i should do to optimize my query?

Sorting happens in memory so the sort stage in aggregation will try and slurp all your data into memory. Hence the error. Given that group generally substantially reduces the amount of data coming out of the pipeline you may get better mileage doing the sort after the group stage. Run a count after the group to see how many documents emerge. This will give you a sense of whether the sort will fit in memory.

1 Like

Thank you for you explanation!When i do match samples.timestamp1 from 2010 to 2015 using sort before group it says that exceed memory.when i do match samples.timestamp1 from 2010 to 2011 using sort before group it returns the right generated documents.So its okay using sort before group only for 1 year of documents…So do you think i should stick as it is now? Thanks in advance

With repsect
Harris Gekas

sort requires that the documents fit in memory unless your specify allowDiskUse. See the example below. The sorting is limited by memory so my guess is data from 2010 to 2011 fits in memory, data from 2011 to 2015 doesn’t. This makes sense as their are more documents in the second dataset. Try and reduce the data set with match or project if you are having these problems. Failing that turn on allowDiskUse which will be slower.

var results = db.stocks.aggregate(
                                   [
                                     { $project : { cusip: 1, date: 1, price: 1, _id: 0 } },
                                     { $sort : { cusip : 1, date: 1 } }
                                   ],
                                   {
                                     allowDiskUse: true
                                   }
                                 )

Yes i tried that but its way slower than just sorting after the group.So i will leave the query as it is here

mydb1.mongodbbucketnocpu.aggregate(
  [
      {
          "$match": {
              "samples.timestamp1": {"$gte": datetime.strptime("2010-01-01 00:00:00", "%Y-%m-%d %H:%M:%S"),
                        "$lte": datetime.strptime("2015-01-01 01:55:00", "%Y-%m-%d %H:%M:%S")},

          }
      },
{ "$unwind": "$samples" },

      {
          "$group": {

              "_id": {"$dateToString": {"format": "%Y-%m-%d %H", "date": "$samples.timestamp1"}},
              "max_id13": {
                  "$max": "$samples.id13"
              }
          }
      },
      {"$sort": {"hour": -1}},
     { "$limit": 5}
      {
          "$project": {
              "_id": 0,
              "day": "$_id",
              "max_id13": 1
          }
      },



    ]
)

Can you post a sample document?

Why do you want to sort before grouping? Your $match stage already ensures that index is being used (if there is one). You can use explain to see the query plan for the aggregation. By the way, this topic is probably a better fit for “Working with Data” as it’s not specific to any Driver or ODM…

Asya