Equivalent of "select distinct on" in mongodb

Hello guys.I am wondering if mongodb have an Equivalent function as select distinct on like sql.
For example if i have this query in Sql:

SELECT DISTINCT ON (department) * FROM employees
ORDER BY department, salary DESC;

How can i do the same query in mongodb?
For mongodb i use python with pymongo.

Hi @harris

There are a couple of ways:

  1. Use collection.distinct(). See the manual for examples, and also the Pymongo manual on how to call this using Pymongo.

  2. Use the aggregation $group stage. This one is slightly more complex but is more flexible. See the manual for examples, and here’s the corresponding Pymongo page for the method. For this method, put the field you want the distinct values of as the _id part of the $group stage.

Some quick examples:

  1. Using distinct(): db.collection.distinct('field_x') will output a list of distinct values within the field field_x.
  2. Using aggregation: db.collection.aggregate([ {$group: {_id: '$field_x'}} ]) will output multiple documents, each with the values of field_x as the _id.

Regarding aggregation, the MongoDB University course M121: The MongoDB Aggregation Framework might be of interest.

Best regards,
Kevin

2 Likes

Hello!Thank you for you help!This is what i have tried:
If the original query in sql is this:

select distinct on (id13) id13, timestamp1
 from oneindextwocolumnsfalse3years 
where timestamp1>='2010-01-01 00:00:00' and timestamp1<='2015-01-01 00:55:00'  
order by id13,timestamp1 desc

This is what i tried

mydb1.mongodbindextimestamp1.aggregate([

    {
        "$match": {
            "timestamp1": {"$gte": datetime.strptime("2010-01-01 00:00:00", "%Y-%m-%d %H:%M:%S"),
                          "$lte" :datetime.strptime("2015-01-01 00:55:00", "%Y-%m-%d %H:%M:%S")}
        }
    },

{
        "$group": {
            "_id":{
                "id_13":"$id13"
        },

      }
},
    {
        "$project": {
            "_id": 0,
            "id13":1,
            "timestamp1":1
        }
    },
    {"$sort": {"id13": 1,"timestamp1":-1}}

])

But it doesn’t seems to work.Do you have any suggestion?

Hi @harris

Do you have an example document? Also, what’s the expected output?

Best regards
Kevin

Yes i have.My documents looks like this:

{
	"_id" : ObjectId("605f104bdc49e72201af5a47"),
	"id1" : 3758,
	"id6" : 2,
	"id7" : -79.09,
	"id8" : 35.97,
	"id9" : 5.5,
	"id10" : 0,
	"id11" : -99999,
	"id12" : 0,
	"id13" : -9999,
	"c14" : "U",
	"id15" : 0,
	"id16" : 99,
	"id17" : 0,
	"id18" : -99,
	"id19" : -9999,
	"id20" : 1197,
	"id21" : 0,
	"id22" : -99,
	"id23" : 0,
	"timestamp1" : ISODate("2010-01-01T01:35:00Z"),
	"timestamp2" : ISODate("2009-12-31T20:35:00Z")
}
{
	"_id" : ObjectId("605f104bdc49e72201af5a48"),
	"id1" : 3758,
	"id6" : 2,
	"id7" : -79.09,
	"id8" : 35.97,
	"id9" : 5.5,
	"id10" : 0,
	"id11" : -99999,
	"id12" : 0,
	"id13" : -9999,
	"c14" : "U",
	"id15" : 0,
	"id16" : 99,
	"id17" : 0,
	"id18" : -99,
	"id19" : -9999,
	"id20" : 1198,
	"id21" : 0,
	"id22" : -99,
	"id23" : 0,
	"timestamp1" : ISODate("2010-01-01T01:40:00Z"),
	"timestamp2" : ISODate("2009-12-31T20:40:00Z")
}

But with my code the output is this:

{}
{}
{}
.
.
.
{}

Hi @harris

I can get the unique values from id13 as per your example, but I’m not sure what you want to do timestamp field there:

[
    {
        "$match": {
            "timestamp1": {
                "$gte": datetime.strptime("2010-01-01 00:00:00", "%Y-%m-%d %H:%M:%S"),
                "$lte" :datetime.strptime("2015-01-01 00:55:00", "%Y-%m-%d %H:%M:%S")}
        }
    },
    {
        "$group": {
            "_id": "$id13"
        },
    },
]

That pipeline outputs:

[{'_id': -9999.0}]

Since aggregation is a pipeline, stages down the pipeline can only access what was made available to them from the previous stages up the pipeline. Thus if you group on _id and don’t specify other fields, the following $project stage won’t have access to the timestamp field anymore.

So from your example, I’m unclear on what the timestamp field should contain. Is it the max timestamp? min timestamp? or should it return an array of timestamps that was matched by the $match stage?

Best regards,
Kevin