Java query - sort, group by with max

Hi :slight_smile:

I am totally new to MongoDB. I need help to improve my Java Query. The query is to get all the child array elements to a document with highest time for a distinct region. Since I use Accumulators.first I need to sort first. My question is: is there a way to skip the sorting part but still get the correct child array for a distinct region with the highest time ?

        Bson match = Aggregates.match(
                Filters.eq("country._id", countryId)
        );

        Bson sort = Aggregates.sort(
                Indexes.descending("time", "region._id")
        );

        Bson group = Aggregates.group(
                "region._id",
                Accumulators.max("time", "$time"),
                Accumulators.first("contentList", "$contentList")
        );

        List<Data> dataList = dataMongoCollection.aggregate(
                Arrays.asList(
                        match,
                        sort,
                        group
                )
        ).into(new ArrayList<>());

Many thanks in advance :slight_smile:

Sample document:

{
   "_id":"ObjectId(""60"")",
   "contentList":[
      {
         "colors":[
            {
               "displayName":"Red",
               "reference":"red_0",
               "value":"red"
            }
         ],
         "country":{
            "_id":"countryId",
            "name":"Sweden"
         },
         "region":{
            "_id":"regionId",
            "name":"Stockholm"
         },
         "score":20002.4,
         "time":NumberLong(16237),
         "weights":[
            {
               "displayName":"Red",
               "reference":"weight_0",
               "value":0.08
            }
         ]
      }
   ],
   "country":{
      "_id":"c_0",
      "name":"Sweden"
   },
   "granularity":"PT15M",
   "region":{
      "_id":"r_0",
      "name":"Stockholm"
   },
   "time":NumberLong(1623751979098)
}

I do not think that the sort is helping you very much and might add processing time if you do have an index that looks like { country._id : 1 , time : -1 , region._id : -1 }. I am not sure if the order of time and region._id in the index is important. Anyhow I would try without the sort first.

Dumping everything into an ArrayList might be slower than using cursor methods as in some circumstances you might be able to start your local processing before all matching documents are done with the pipeline.

Thank you, steevej :slight_smile: Appreciated very much.
Since I use first to extract ContentList I believe that sort is necessary since first just return the first best document not the one with the highest value of time. I have tested only with first without sorting it doesn’t work.

Good point with your comment on dumping part :slight_smile:

Is there a way to extract the contentList for the correct document, i.e. the document with the highest value of time to a distinct region without using first ?

Hi @Yaqin_Chen_Hedin :smiley: !

I think in your case you case use an index on the $group stage because you are in this very particular case documented here in the $group section:

But in this case, the index will only work on the $group stage if you ONLY use $first in your $group stage. As you are sorting in descending order right before anyway, the first value of $time should always be the first & max one at the same time. So replace Accumulators.max by Accumulators.first on the $time and create the index:

{ "country._id": 1, "time": -1, "region._id": -1, "contentList": 1}

I’m not 100% this will work because you are using region._id as your “group by” field.

Can you give it a try and let me know what the explain output of this aggregation tells you?

I did a little test with your single document and this aggregation:

[
  {
    '$match': {
      'country._id': 'c_0'
    }
  }, {
    '$sort': {
      'time': -1, 
      'region._id': -1
    }
  }, {
    '$group': {
      '_id': '$region._id', 
      'time': {
        '$first': '$time'
      }, 
      'contentList': {
        '$first': '$contentList'
      }
    }
  }
]

I get this:

{
	"stages" : [
		{
			"$cursor" : {
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "test.coll",
					"indexFilterSet" : false,
					"parsedQuery" : {
						"country._id" : {
							"$eq" : "c_0"
						}
					},
					"queryHash" : "B626CD4F",
					"planCacheKey" : "68E972A3",
					"winningPlan" : {
						"stage" : "PROJECTION_DEFAULT",
						"transformBy" : {
							"contentList" : 1,
							"region._id" : 1,
							"time" : 1,
							"_id" : 0
						},
						"inputStage" : {
							"stage" : "FETCH",
							"inputStage" : {
								"stage" : "IXSCAN",
								"keyPattern" : {
									"country._id" : 1,
									"time" : -1,
									"region._id" : -1,
									"contentList" : 1
								},
								"indexName" : "country._id_1_time_-1_region._id_-1_contentList_1",
								"isMultiKey" : true,
								"multiKeyPaths" : {
									"country._id" : [ ],
									"time" : [ ],
									"region._id" : [ ],
									"contentList" : [
										"contentList"
									]
								},
								"isUnique" : false,
								"isSparse" : false,
								"isPartial" : false,
								"indexVersion" : 2,
								"direction" : "forward",
								"indexBounds" : {
									"country._id" : [
										"[\"c_0\", \"c_0\"]"
									],
									"time" : [
										"[MaxKey, MinKey]"
									],
									"region._id" : [
										"[MaxKey, MinKey]"
									],
									"contentList" : [
										"[MinKey, MaxKey]"
									]
								}
							}
						}
					},
					"rejectedPlans" : [ ]
				}
			}
		},
		{
			"$group" : {
				"_id" : "$region._id",
				"time" : {
					"$first" : "$time"
				},
				"contentList" : {
					"$first" : "$contentList"
				}
			}
		}
	],
	"serverInfo" : {
		"host" : "hafx",
		"port" : 27017,
		"version" : "4.4.6",
		"gitVersion" : "72e66213c2c3eab37d9358d5e78ad7f5c1d0d0d7"
	},
	"ok" : 1,
	"$clusterTime" : {
		"clusterTime" : Timestamp(1623794133, 5),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	},
	"operationTime" : Timestamp(1623794133, 5)
}

But I’m not sure if the $group is actually covered by the index here… But at least the index is used so it’s a good sign.

Cheers,
Maxime.

Hi @MaBeuLux88 :slight_smile:
Thank you very much for your response ! Highly appreciated :slight_smile:

Your query is very similar to mine. I want to avoid the sort step since the data can be very very big. :slight_smile:

Best regards
yEH