Need advise, cannot sort and select required document from group and push

Hi ,

I’ve a goal to query data from collection which is huge > 2TB, I need to query only the data I need as CSV file to process in Python. What I need to do while query data is

  1. Grouping and sorting document.
  2. Select only the certain range and export as CSV.

Right now I can do pipeline like this.

db.collection.aggregate(
    [
        { "$limit": 200000 },
        {
            "$group":
                {
                    "_id":
                        {
                            "TID": "$TID",
							"Opt": "$Opt",
                            "DSN1": "$DSN1",
							"DSN2": "$DSN2",
                            "Column": "$Column",
							"Row": "$Row",
                            "CSN": { "$substr": ["$CSN", 2, -6] }
                        },
                    "details": {
                        "$push":
                            {
                                "A": "$A",
                                "B": "$B",
                                "C": "$C",
                                "D": "$D",
                                "timestamp": "$timestamp"
                            }
                    }
                }
        }
    ],
    {
        allowDiskUse: true
    }
);   

Output from this pipeline is like {"_id": { set of “details” according to _id} } as picture

From this point I’m not sure how to sort by “details.timestamp” and select only certain range of “details”. Moreover, how can I break output as 1 “_id”: to 1 “details” , not 1 “_id” to set of “detail” which currently I have? 1 “_id” to 1 “details” is more suitable for my process.

Please kindly advise.

Thanks.

Hi @Preutti_Puawade,

Is it possible to filter by using $match and $sort before the limit and group? If yes you should consider doing so and index those fields in the order of predicates going by Equility , Sort and Range of the queried fields.

To break the results I believe you will need an $unwind stage in the end to have each pushed element as a single document :

{
$unwind : "$details"
}

After this unwind you can also sort and filter the unwinded documents but be aware that it will all be in memory and cannot use indexes therefore to optimise the pipeline try to filter the documents as much as you can in first stage of the pipeline.

To export to csv you can use a scripting language like python and pymongo or use mongoexport when exporting a view which you can define with your pipeline…

Thanks
Pavel

Hi

@Pavel_Duchovny thanks for advise. From my knowledge, I don’t think I can filter by using $match and $sort before the limit and group. Maybe if I can break down my operation you can guide me more,

I group as _id by TID, Opt, DSN1, DSN2, Column, Row, CSN to have partition of data. Then push “details” in the group and want to sort details by details.timestamp. Then I want to query only 30th - 180th of details in each _id and unwind them before export as CSV.

Currently I have done this by pull original document no aggregation and perform this action in Python using pandas and the code is just below

df[‘RN’] = df.sort_values([‘timestamp’], ascending=[True])
.groupby([‘DSN1’,‘DSN2’,‘TID’,‘Column’,‘Row’,‘Opt’,‘CSN’])
.cumcount() + 1

df2=df.loc[(df[‘RN’] >= 30) & (df[‘RN’] <= 180)]

do you have suggestion or alternative ?

Thanks.

Regards,

Ohm

@Preutti_Puawade,

I think you should be able to sort on the beginning with timestamp field. If you need to do a ASC sort for example create index on { timestamp : 1}.

db.collection.aggregate(
[
{ "$sort" : { timestamp : 1 } },
{ “$limit”: 200000 },
{
“$group”:
{
“_id”:
{
“TID”: “$TID”,
“Opt”: “$Opt”,
“DSN1”: “$DSN1”,
“DSN2”: “$DSN2”,
“Column”: “$Column”,
“Row”: “$Row”,
“CSN”: { “$substr”: ["$CSN", 2, -6] }
},
“details”: {
“$push”:
{
“A”: “$A”,
“B”: “$B”,
“C”: “$C”,
“D”: “$D”,
“timestamp”: “$timestamp”
}
}
}
},
{ $unwind : {
      path: "$details",
      includeArrayIndex: "arrIndex"
}
},
{ $match : { "arrIndex" : { "$gte" : 30 },  "arrIndex" : { "$lte" : 180 }}}
],
{
allowDiskUse: true
}
); 

Something like that might be ok for you.

Thanks,
Pavel

@Pavel_Duchovny Thank you very much. I’ve spent time of the whole yesterday to do this and got something quite close to your code, mine put “$sort” after “$group”

   {
        "$sort": {
            "_id": 1.0,
            "timestamp": 1.0
        }

I’ll try yours in case it got better performance. Thank you again for helping on this. Appreciated your kindly advise.

Thanks & Regards,

Ohm.

Just to record of complete pipeline:

db.collection.aggregate(
    [
        {
            "$match": {
                "$and": 
                    [
					    { "TID": /XP/ },
                        { "timestamp": { "$gte": "2021-05-19 00:00:00" } },
                        { "timestamp": { "$lte": "2021-05-27 23:59:59" } },
                        { "Col": /^\d.*$/  },
                        { "MT0": /^\d.*$/ },
                        { "Row": /^\d.*$/ },
                        { "TRM": /^\d.*$/ },
                        { "WRLFTR_0": /^\d.*$/ },
                        { "WRLFTR_1": /^\d.*$/ }
                    ]
            }
        },
        {
            "$group": {
                "_id": {
                    "TID": "$TID",
                    "Opt": "$Opt",
                    "DSN1": "$DSN1",
                    "DSN2": "$DSN2",
                    "Col": "$Col",
                    "Row": "$Row",
                    "CSN": "$CSN",                                   
                },
                "details": {
                    "$push": {                           // partition over
                        "MT0": "$MT0",
                        "WRLFTR_0": "$WRLFTR_0",
                        "WRLFTR_1": "$WRLFTR_1",
						"TRM": "$TRM",
                        "timestamp": "$timestamp"
                    }
                },
            }
        },
        {
            "$sort": {
                "_id": 1.0,
                "timestamp": 1.0                        // order by timestamp
            }
        },
        {
            "$unwind": {
                "path": "$details",
                "includeArrayIndex": "array_idx"        // row_number
            }
        },
        {
            "$match": {                                // select only order 30 - 180 in array
                "$and": [
                    {
                        "array_idx": {
                            "$gte": 30.0
                        }
                    },
                    {
                        "array_idx": {
                            "$lte": 180.0
                        }
                    }
                ]
            }
        },
        {
            "$sort": {                                // sort TRM for percentile calculation
                "details.TRM": 1.0
            }
        },        
        {
            "$group": {                                // group parameter back to array by partition "_id" 
                "_id": "$_id",
                "timestamp": { "$push": {"$toDate":"$details.timestamp"} },
                "MT0": { "$push": {"$toDouble":"$details.MT0"} },
                "TRM": { "$push": {"$toDouble":"$details.TRM"} },
                "WRLFTR_0": { "$push": {"$toDouble":"$details.WRLFTR_0"} },
                "WRLFTR_1": { "$push": {"$toDouble":"$details.WRLFTR_1"} },
            }
        },			
        {                                               // reporting "_id" and calculating 
            "$project": {
                "_id": 0,
                "TID": "$_id.TID",
                "Col": {
                    "$toDecimal": "$_id.Col"
                },
                "Row": {
                    "$toDecimal": "$_id.Row"
                },
                "CSN": {
                    "$substr": [
                        "$_id.CSN",
                        2.0,
                        -6.0
                    ]
                },
                "Opt": "$_id.Opt",
                "PID": "$_id.PID",                 
                "DSN1": "$_id.DSN1",
                "DSN2": "$_id.DSN2",
                "CBU_ID": "$_id.CBU_ID",                
                "MAX_TRM": {
                    "$max": "$TRM"
                },
                "Q3_TRM": { 
                    "$arrayElemAt": [ "$TRM", {"$floor": { "$multiply": [0.75,{"$size": "$TRM"}] }} ]  // pick up number at 75th percent of all data as Percentile 75
                },
                "AVG_TRM": {
                    "$avg": "$TRM"
                },
                "MAX_WRLFTR_0": {
                    "$max": "$WRLFTR_0"
                },
                "MAX_WRLFTR_1": {
                    "$max": "$WRLFTR_1"
                },
                "MAX_MT0": {
                    "$max": "$MT0"
                },
                "MAX_timestamp": {
                    "$max": "$timestamp"
                },
               
            }
        },
    ],
    {
        "allowDiskUse": true
    }
);

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.