Help writing aggregation query using PyMongo instead of pandas

I have a mongodb data like below:

     code    date      num  price  money
0      2  2015-11-15   10    3.8  -38.0
1      2  2015-11-17  -10    3.7   37.0
2      2  2015-11-20   20    3.5  -70.0
3      2  2016-04-01   10    3.2  -32.0
4      2  2016-04-02  -30    3.6  108.0
5      2  2016-04-03   50    3.4 -170.0
6      2  2016-11-01  -40    3.5  140.0
7      3  2015-02-01   25    7.0 -175.0
8      3  2015-05-01   35    7.5 -262.5
9      3  2016-03-01  -15    8.0  120.0
10     5  2015-11-20   50    5.0 -250.0
11     5  2016-06-01  -50    5.5  275.0
12     6  2015-02-01   35   11.5 -402.5 

I want to get the number of securities held and the funds currently occupied by the securities If I take out the data, I can get the result I want in the following way:

import pandas as pd
import numpy as np

df=pd.DataFrame({'code': [2,2,2,2,2,2,2,3,3,3,5,5,6],
        'date': ['2015-11-15','2015-11-17','2015-11-20','2016-04-01','2016-04-02','2016-04-03','2016-11-01','2015-02-01','2015-05-01','2016-03-01','2015-11-20','2016-06-01','2015-02-01'],
        'num' : [10,-10, 20, 10, -30,50, -40, 25, 35, -15, 50, -50, 35],
        'price': [3.8,3.7,3.5,3.2, 3.6,3.4, 3.5, 7, 7.5,  8, 5,  5.5, 11.5],
        'money': [-38,37,-70,-32, 108,-170, 140,-175,-262.5,120,-250, 275,-402.5]
        })


print(df,"\n------------------------------------------\n")
df['hold'] = df.groupby(['code'])['num'].cumsum()
df['type'] = np.where(df['hold'] > 0, 'B', 'S')
df['total']=df['total1']= df.groupby(['code'])['money'].cumsum()

def abc(dfg):
  if dfg[dfg['hold'] == 0]['hold'].count():
    subT = dfg[dfg['hold'] == 0]['total1'].iloc[-1]
    dfg['total'] = np.where(dfg['hold'] > 0, dfg['total']-subT, dfg['total'])
  return dfg
dfR = df.groupby(['code'], as_index=False)\
    .apply(abc) \
    .drop(['type', 'total1'], axis=1) \
    .reset_index(drop=True)

df1=dfR.groupby(['code']).tail(1)
print(df1,"\n------------------------------------------\n")

out

     code     date    num  price  money  *hold*  *total*
6      2  2016-11-01  -40    3.5  140.0    *10*  *-30.0*
9      3  2016-03-01  -15    8.0  120.0    *45* *-317.5*
11     5  2016-06-01  -50    5.5  275.0     *0*   *25.0*
12     6  2015-02-01   35   11.5 -402.5    *35* *-402.5* 

If use the mongodb method (such as aggregate, or other), how can i directly obtain the same result as above?

@fei_hua Welcome to the MongoDB community.

The aggregation query is the correct approach. An aggregation query has pipeline with stages. Each stage processes the documents which is input to the next stage. The first stage in the pipeline has the collection documents as input. In your aggregation you need to use two stages to get the desired result.

The first stage is $sort. This is to sort the code and date, both in ascending order.

The second stage is the $group stage. This is to group by the the code field and get the output with the accumulations. The grouping key is the code field. Sum the num to get the hold, and sum the money to get the total - see the $sum accumulater operator. The last value of the sorted code+date gives you the num, price, money - see $last aggregation group operator.

The links to aggregation provided above are for mongo shell methods. Here is link to PyMongo documentation.

Note that you can also build an aggregation query in the Compass GUI tool and generate PyMongo code automatically - using the Aggregation Pipeline Builder (after building the pipeline click the Export to Language button).

sort = {'$sort': {
    'code': 1,
    'date': 1,
}}

group = {'$group': {
    '_id': {'code': '$code'},
    "date": { "$last": "$date" }, 
    'hold': {'$sum': '$num'},
    'total': {'$sum': '$money'},
}}

mydoc=mycol.aggregate([sort,group])
data = pd.DataFrame(mydoc)
print(data)

Is that so?
But the result is:

             _id       date  hold  total
0  {'code': '2'} 2016-11-01  10.0  **-25.0**
1  {'code': '6'} 2015-02-01  35.0 -402.5
2  {'code': '5'} 2016-06-01   0.0   25.0
3  {'code': '3'} 2016-03-01  45.0 -317.5

In the $group stage you can use the $last on other fields you want in the output (other than the accumulated values). Now add following $project and $sort stages (after the $group), to get the result:

{ '$project': { 'code': '_id.code', 'date': 1, 'num': 1, 'price': 1, 'money': 1, 'hold': 1, 'total': 1, '_id': 0 } }
{ '$sort': { 'code': 1 } }

I mean, the total of the first line, it is not the desired result

Try this code:

pipeline = [
  { '$sort': { 'code': 1, 'date': 1 } },
  { '$group': { '_id': '$code', 'num': { '$last': '$num' }, 'price': { '$last': '$price' }, 'money': { '$last': '$money' }, 'hold': { '$sum': '$num' }, 'total': { '$sum': '$money' } } },
  { '$project': { 'code': '$_id', 'date': 1, 'num': 1, 'price': 1, 'money': 1, 'hold': 1, 'total': 1, '_id': 0 } },
  { '$sort': { 'code': 1 } }
]

pprint.pprint(list(collection.aggregate(pipeline)))

Yes, I ran it with the code you said, look at the number in the first line as the total
It is -25.0, the number I want is -30.0

    num  price  money  hold  total code
0 -40.0    3.5  140.0  10.0  -25.0    2
1 -15.0    8.0  120.0  45.0 -317.5    3
2 -50.0    5.5  275.0   0.0   25.0    5
3  35.0   11.5 -402.5  35.0 -402.5    6

The number -30.0 - how did you get (calculate) that figure? Please explain.

Use the following method
I just want to see if it is possible to use aggregation query (or other methods of pymongo) to get this result

I am not familiar with the pandas. If you can explain in plain English or pseudo-code, I can try.

     code    date      num  price  money
0      2  2015-11-15   10    3.8  -38.0
1      2  2015-11-17  -10    3.7   37.0
2      2  2015-11-20   20    3.5  -70.0
3      2  2016-04-01   10    3.2  -32.0
4      2  2016-04-02  -30    3.6  108.0
5      2  2016-04-03   50    3.4 -170.0
6      2  2016-11-01  -40    3.5  140.0
7      3  2015-02-01   25    7.0 -175.0
8      3  2015-05-01   35    7.5 -262.5
9      3  2016-03-01  -15    8.0  120.0
10     5  2015-11-20   50    5.0 -250.0
11     5  2016-06-01  -50    5.5  275.0
12     6  2015-02-01   35   11.5 -402.5 

Calculate groupby code from top to bottom
a=sum(num)
b=0
c=sum(money)

Judgment during accumulation
if(a==0){
b+=c;
c=0;
}

Judgment after accumulation
if(a==0){
total=b
}else{
total=c
}
Just like the above, I don’t know if I express it clearly

If calculated in excel
excel formula:

code date num price money hold total(middle)
2 2015-11-15 10 3.8 -38 =C2 =E2
2 2015-11-17 -10 3.7 37 =F2+C3 =G2+E3
2 2015-11-20 20 3.5 -70 =F3+C4 =G3+E4
2 2016-4-1 10 3.2 -32 =F4+C5 =G4+E5
2 2016-4-2 -30 3.6 108 =F5+C6 =G5+E6
2 2016-4-3 50 3.4 -170 =F6+C7 =G6+E7
2 2016-11-1 -40 3.5 140 =F7+C8 =G7+E8
3 2015-2-1 25 7 -175 =C9 =E9
3 2015-5-1 35 7.5 -262.5 =F9+C10 =G9+E10
3 2016-3-1 -15 8 120 =F10+C11 =G10+E11
5 2015-11-20 50 5 -250 =c12 =E12
5 2016-6-1 -50 5.5 275 =F12+C13 =G12+E13
6 2015-2-1 35 11.5 -402.5 =c14 =E14

Result of formula calculation in excel:

code date num price money hold total(middle)
2 2015-11-15 10 3.8 -38 10 -38
2 2015-11-17 -10 3.7 37 0 -1
2 2015-11-20 20 3.5 -70 20 -71
2 2016-4-1 10 3.2 -32 30 -103
2 2016-4-2 -30 3.6 108 0 5
2 2016-4-3 50 3.4 -170 50 -165
2 2016-11-1 -40 3.5 140 10 -25
3 2015-2-1 25 7 -175 25 -175
3 2015-5-1 35 7.5 -262.5 60 -437.5
3 2016-3-1 -15 8 120 45 -317.5
5 2015-11-20 50 5 -250 50 -250
5 2016-6-1 -50 5.5 275 0 25
6 2015-2-1 35 11.5 -402.5 35 -402.5

then
a=The last total (middle) of the code corresponding to the hold equal to 0
If the a of code does not exist
a=0

if the hold of code is not equal to 0
total=total(middle)-a
else
total=total(middle)

then take the last line of data for each code,that is the result i want

map = Code("""function () {
        emit(this.code, {hold:this.num,total:this.money,total1:0});
    }""")

reduce = Code("""function (key, values) {
        var a={hold:0,total:0,total1:0};
        for (var i = 0; i < values.length; i++) {
            a.hold += values[i].hold;
            a.total +=values[i].total;
            a.total1 +=values[i].total;
            if (a.hold==0){
                a.total1=0
            }
        }
        return a;
    }""")

I have used map_reduce to get the figure, but I still want to ask, is it possible to use aggregation query to get such results

You can try this aggregation:

pipeline = [
  {
       '$sort': { 'code': 1, 'date': 1 }
   },
  { 
      '$group': { 
          '_id': '$code', 
          'num': { '$last': '$num' }, 'price': { '$last': '$price' }, 'money': { '$last': '$money' }, 
          'code_data': { '$push': { 'n': "$num", 'm': "$money" } } 
      } 
  },
  { 
      '$addFields': { 
          'result': { 
               '$reduce': { 
                   'input': '$code_data', 
                   'initialValue': { 'hold': 0, 'sum_m': 0, 'total': 0 }, 
                   'in': { 
                       '$let': {
                           'vars': { 
                               'hold_': { '$add': [ '$$this.n', '$$value.hold' ] },  
                               'sum_m_': { '$add': [ '$$this.m',  '$$value.sum_m' ] }
                            },
                            'in': { 
                                '$cond': [ { '$eq': [ '$$hold_', 0 ] },  
                                           { 'hold': '$$hold_', 'sum_m': 0, 'total': '$$sum_m_' },
                                           { 'hold': '$$hold_', 'sum_m': '$$sum_m_', 'total': '$$sum_m_' }
                                         ] 
                            }
                        }
                    }
               } 
           }
      } 
  },
  { 
      '$addFields': { 'code': '$_id',  'hold': '$result.hold', 'total': '$result.total' } 
  },
  { 
      '$project': { 'code_data': 0, 'result': 0, '_id': 0 } 
  },
  { 
      '$sort': { 'code': 1 } 
  }
]

thank you very much

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.