Determining time overlapping between two different kinds of files from the same collection

I have a face recognition collection with two different kinds of files:

1) Broadcasting files:

{
    "_id": {"$oid": "5e870200adbe1d000183fa4d"},
    "data": 
    {
        "begin": "2020-03-30 10:20:29",
        "end": "2020-03-30 10:20:32",
        "file": "salvamento4.mp4",
        "type": "video"
    },
    "idSensor": 3,
    "idDevice": 5
}

2) Reaction (to broadcasting) files:

{
    "_id": {"$oid": "5e86fe50adbe1d0001472c0f"},
    "data": 
    {
        "Trackings":
        [{
            "BeginTime": "2020-03-30T08:23:42.034893+00:00",
            "FaceInfo":
            {
                "Age": 26.34,
                "Emotion": "NEUTRAL",
                "IsDetected": true,
                "MaleProbability": 0.71,
                "gazeTime": 2.37,
                "numGazes": 71
            },
            "ImageSize": 
            {
                "Height": 1080,
                "Width": 1920
            },
            "LookingDuration": 2.37,
            "PersonID": "P-2020-03-30_2749",
            "ReIDInfo": {"NumReIDs": 1},
            "RoiInfo": {"RoiDuration": 0.17},
            "SensorID": 0,
            "TrackingDuration": 2.77,
            "Trajectory": null,
            "direction": null,
            "id": 1,
            "roiName": 0,
            "roiType": 1
        }],
        "timestamp": "2020-03-30T08:23:52.327678"
    },
    "idSensor": 2,
    "idDevice": 5
}

Join field is idDevice: a device broadcasts videos and records reactions at the same time.

I need to cross both kind of files to determine which emissions are watched by which people, in order to estimate with a BI software if some videos have greater impact on audience than others. There are tons of emissions but only a little amount of different videos; different reactions might also come from recurrent customers (that’s why there is a PersonID).

The idea is to check overlapping between broadcasting time (that starts at data.begin and finishes at data.end) and reaction time (that starts at data.Trackings.BeginTime and finishes at data.Trackings.BeginTime + data.Trackings.TrackingDuration) in order to get a table similar to this (just a simple example for one video that provokes three different reactions; ultimate outcome would include also other parameters like Emotion, Age, etc.):

idDevice idBroadcast dtBrBeginTime    dtBrEndTime      idReaction dtReBeginTime    dtReEndTime 
1        1           2020-07-03 10:00 2020-07-03 10:03 1          2020-07-03 09:58 2020-07-03 10:02
1        1           2020-07-03 10:00 2020-07-03 10:03 2          2020-07-03 10:01 2020-07-03 10:07
1        1           2020-07-03 10:00 2020-07-03 10:03 3          2020-07-03 10:01 2020-07-03 10:02

In this simple example, 1 emission has been watched by 3 people (or has triggered 3 different reactions); i.e., 1 broadcasting file is related to 3 reaction files. How do we know this? I think the simplest way (correct me if you think there’s a better solution) is to verify these two conditions:

  • data.Trackings.BeginTime (or dtReBeginTime) not gte data.end (or dtBrEndTime)
  • data.Trackings.BeginTime + data.Trackings.TrackingDuration (or dtReEndTime) not
    lte data.begin (or dtBrBeginTime)

My expertise in MongoDb is limited to making very simple A-F queries ($match, $project, I also have used $unwind for breaking data.Trackings into parts), so I have almost no idea about how to address this issue… maybe with $lookup? I’d appreciate any kind of help.

Thanks a lot in advance.

Hello, @Javier_Blanco! Welcome to the community!

If I understood you correctly, you need to join ‘broadcasting’ collection with ‘reaction’ collection and then get the trackings, that were taken within range between ‘begin’ and ‘end’ documents from ‘broadcasting’ collection, right?

If so, you need to use $lookup + $unwind to join collections, then you need to match the necessary documents with $filter and add them to a document with $addFields. In the end up $group documents, because you used $unwind before.

Let me simplify your dataset to show it on example.
So, assume we have 2 collections ‘test6’ and ‘test7’:

db.test6.insertMany([
  {
    "_id": "D1",
    "data":
      {
        "begin": "2020-03-30 10:15:45",
        "end": "2020-03-30 10:20:45",
      },
    "idDevice": 5
  }
]);

db.test7.insertMany([
  {
    "_id": "B1",
    "data":
      {
        "trackings": [
          {
            "_id": "T1-of-B1",
            "beginTime": "2020-03-30 10:10:45",
            "endTime": "2020-03-30 10:14:45",
          },
          {
            "_id": "T3-of-B1",
            "beginTime": "2020-03-30 10:15:45",
            "endTime": "2020-03-30 10:16:45",
          },
          {
            "_id": "T3-of-B1",
            "beginTime": "2020-03-30 10:14:45",
            "endTime": "2020-03-30 10:25:45",
          },
        ]
      },
    "idDevice": 5
  },
  {
    "_id": "B2",
    "data":
      {
        "trackings": [
          {
            "_id": "T1-of-B2",
            "beginTime": "2020-03-30 10:13:45",
            "endTime": "2020-03-30 10:16:45",
          },
          {
            "_id": "T2-of-B2",
            "beginTime": "2020-03-30 10:17:45",
            "endTime": "2020-03-30 10:17:45",
          },
        ]
      },
    "idDevice": 5
  }
]);

To achieve the goal we could use this aggregation:

db.test6.aggregate([
  {
    $lookup: {
      // join data form 'test7' into 'test6' collection
      from: 'test7',
      localField: 'idDevice',
      foreignField: 'idDevice',
      as: 'reactions',
    }
  },
  {
    $unwind: '$reactions',
  },
  {
    $addFields: {
      matchedTrackings: {
        $filter: {
          input: '$reactions.data.trackings',
          cond: {
            // filter trakings whatever you like here
            $and: [
              { $lte: ['$data.begin', '$$this.beginTime'] },
              { $gte: ['$data.end', '$$this.endTime'] }
            ]
          }
        },
      }
    }
  },
  {
    // group documents back after calculations are made
    $group: {
      _id: '$_id',
      data: {
        $first: '$data',
      },
      idDevice: {
        $first: '$idDevice',
      },
      reactions: {
        $push: '$reactions',
      },
      matchedTrackings: {
        $push: '$matchedTrackings'
      }
    }
  }
]).pretty();

The above aggregation provides the result in the following format:

{
  "_id" : "D1",
  "data" : {
    "begin" : "2020-03-30 10:15:45",
    "end" : "2020-03-30 10:20:45"
  },
  "idDevice" : 5,
  "reactions" : [
    {
      "_id" : "B1",
      "data" : {
        "trackings" : [
          {
            "_id" : "T1-of-B1",
            "beginTime" : "2020-03-30 10:10:45",
            "endTime" : "2020-03-30 10:14:45"
          },
          {
            "_id" : "T3-of-B1",
            "beginTime" : "2020-03-30 10:15:45",
            "endTime" : "2020-03-30 10:16:45"
          },
          {
            "_id" : "T3-of-B1",
            "beginTime" : "2020-03-30 10:14:45",
            "endTime" : "2020-03-30 10:25:45"
          }
        ]
      },
      "idDevice" : 5
    },
    {
      "_id" : "B2",
      "data" : {
        "trackings" : [
          {
            "_id" : "T1-of-B2",
            "beginTime" : "2020-03-30 10:13:45",
            "endTime" : "2020-03-30 10:16:45"
          },
          {
            "_id" : "T2-of-B2",
            "beginTime" : "2020-03-30 10:17:45",
            "endTime" : "2020-03-30 10:17:45"
          }
        ]
      },
      "idDevice" : 5
    }
  ],
  "matchedTrackings" : [
    [
      {
        "_id" : "T3-of-B1",
        "beginTime" : "2020-03-30 10:15:45",
        "endTime" : "2020-03-30 10:16:45"
      }
    ],
    [
      {
        "_id" : "T2-of-B2",
        "beginTime" : "2020-03-30 10:17:45",
        "endTime" : "2020-03-30 10:17:45"
      }
    ]
  ]
}

You can $project out fields, that are not necessary for the output.

1 Like

Hi, @slava, thanks a lot for your answer!

Yes, that’s the idea basically. Maybe I didn’t point this out, but reaction files don’t need to be within range of broadcasting files strictly but just overlapping it; for instance, a reaction that went from 9:57 to 10:02 would match a video that went from 10:01 to 10:05.

On the other hand, both kinds of files are within the same collection in my DB, not sure if that’s a handicap for using $lookup, for instance.

I’m going to try your solution and come back if I need additional guidance.

You may try to adjust ‘$and’ conditions:

$and: [
  // reaction began before broadcasting ended;
  { $lt: ['$data.begin', '$$this.endTime'] },
  // reaction ended just when broadasting ended or later
  { $gte: ['$data.end', '$$this.endTime'] }
]

$lookup will work just fine. If you use the same collection name - it will do a self join.

1 Like

Oh, for some reason, Trackings comes as an array of objects but in every file I have checked it only contains one, at index 0. Anyway, I guess it’s better to suppose it might have more than one element as @slava has done, just in case.

Hi again, @slava,

For what I see in a demo collection with just 8 files, a self $lookup generates 8 files with an 8 objects reactions array each, so in my development collection the outcome might be so huge that the server (I’m using NoSQL Booster as client) is unable to process the request:

Total size of documents in sensorsData matching pipeline’s $lookup stage exceeds 104857600 bytes

Adding allowDiskUse doesn’t make any difference.

Helo, @Javier_Blanco,
Can you provide:

  • a prettified sample document from your collection with only the props, that are involved in aggregation?
  • the actual prettified aggregation that you’re using
  • output of db.collection.stats()

Hi, @slava,

Sorry for the delay, I’ve been busy with other projects.

Not very sure about this issue, the DBA has made some changes and now it seems to work… So for the moment it’s OK. Let’s see in a while…

Regarding the $addFields step of your query, there’s no endTime within data.Trackings, it must be generated as the sum of BeginTime and TrackingDuration; should I $unwind data.Trackings first or is it possible to solve this issue in a simpler way?

Thanks again!

Within $addFields, I have tried to change

$$this.endTime

by

$$this.{"$add": [{"$toDate": "$data.Trackings.BeginTime"}, {"$multiply": ["$data.Trackings.TrackingDuration", 1000]}]}

and it seems to work -or at least I don’t get an error message-.

OK, I have done this finally (it’s a sample DB with just a few files):

var pipeline = 
[
    {
        "$lookup":
        {
            "from": "data",
            "localField": "idDevice",
            "foreignField": "idDevice",
            "as": "reactions"
        }
    },
    {
        "$unwind": "$reactions"
    },
    {
        "$match": {"$and": [{"data.begin": {"$exists": true}}, {"reactions.data.timestamp": {"$exists": true}}]}
    },
    {
        "$project":
        {
            "_id": 0,
            "idDevice": "$idDevice",
            "naBroadcast": "$data.file",
			"naType": "$data.type",
            "dtBroadcastStart": {"$toDate": "$data.begin"},
            "dtBroadcastEnd": {"$toDate": "$data.end"},
            "array": "$reactions.data.Trackings",
        }
    },
    {
		"$unwind": "$array"
	},
	{
        "$project":
        {
            "idDevice": 1,
            "naBroadcast": 1,
			"naType": 1,
            "dtBroadcastStart": 1,
            "dtBroadcastEnd": 1,
            "qtBroadcastDurationS": {"$divide": [{"$subtract": [{"$toDate": "$dtBroadcastEnd"}, {"$toDate": "$dtBroadcastStart"}]}, 1000]},
            "idPerson": "$array.PersonID",
			"dtTrackingStart": {"$toDate": "$array.BeginTime"},
			"dtTrackingEnd": {"$add": [{"$toDate": "$array.BeginTime"}, {"$multiply": ["$array.TrackingDuration", 1000]}]},
			"qtFaceDetected": 
			{
				"$cond": 
				{
					"if": {"$eq": ["$array.FaceInfo.IsDetected", true]}, 
					"then": 1, 
					"else": 0
				}
			},
			"qtMaleProbability": "$array.FaceInfo.MaleProbability",
			"qtAge": "$array.FaceInfo.Age",
			"naEmotion": "$array.FaceInfo.Emotion",
			"qtGaze": "$array.FaceInfo.numGazes",
			"qtGazeDurationS": "$array.FaceInfo.gazeTime",
			"qtFaceDurationS": "$array.LookingDuration",
			"qtTrackingDurationS": "$array.TrackingDuration",
			"qtReId": "$array.ReIDInfo.NumReIDs"
        }
	},
]

db.data.aggregate(pipeline)

Starting with a sample of 3 broadcasting files and 5 reaction files, I get 15 broadcasting-reaction files, all combined chances. The way they look:

Now I need to filter them to get just the few ones that check both conditions:

  • dtTrackingEnd not lte dtBroadcastStart
  • dtTrackingStart not gte dtBroadcastEnd

I have tried one last stage within my pipeline:

{
    "$match":
    {
        "$and": [{"dtTrackingEnd": {"$ne": ["$lte", "$dtBroadcastStart"]}}, {"dtTrackingStart": {"$ne": ["$gte", "$dtBroadcastEnd"]}}]
    }
}

It doesn’t work. I get all 15 files again, while according to my calculations I should get only 6. Not sure if $ne is what I need, but I get an error if I try to use $not. Any hint?

Thanks in advance!

Hello, @Javier_Blanco!

Try to integrate the filter that I have suggested you before into your $match stage:

Hint:

{"$ne": ["$lte", "$dtBroadcastStart"]}

$ne - is an operator.
$lte and dbBroadcastStart are its arguments and since they have -sign in the beginning, MongoDB treats both of them as field names, so $lte will be always ‘undefined’, because you did not declared such prop in your $project stage and $dbBroadcastStart will evaluate to some real data.

$ne will compare its two arguments and return ‘true’ for a document, if the first argument ($lte) is not equal to the second argument ($dtBroadcastStart) and since ‘undefined’ will always be not equal to some real data, that condition will return ‘true’ for all your 15 documents. That’s why all of them pass your last $match stage.

1 Like
{
    "$match":
    {
        "$and": 
        [
            {"$lt": ["$dtBroadcastStart", "$this.dtTrackingEnd"]},
            {"$gte": ["$dtBroadcastEnd", "$this.dtTrackingEnd"]}
        ]
    }
}

Regardless of the content of the query, I get this error message: unknown top level operator: $lt. This has been happening to me constantly while trying to implement that last $match stage…

I think the proper query would be (1):

{
    "$match":
    {
        "$and": 
        [
            {"dtBroadcastStart": ["$lt", "$dtTrackingEnd"]}, 
            {"dtBroadcastEnd": ["$gt", "$dtTrackingStart"]}
        ]
    }
}

But it returns nothing (at least spits back no error message):
image

However, I have uploaded my query without that last $match stage to my BI software and applied a boolean filter to all 15 rows (files turn into rows in the datasets of that app); in fact, two boolean filters:

  • First equivalent to the not one from previous posts:
    cond(and(not(lte([dtTrackingEnd], [dtBroadcastStart])), not(gte([dtTrackingStart], [dtBroadcastEnd]))), 1, 0)

  • Second equivalent to (1):
    cond(and(lt([dtBroadcastStart], [dtTrackingEnd]), gt([dtBroadcastEnd], [dtTrackingStart])), 1, 0)

And both work; both return just the same six rows (or files) I had calculated by myself would be the proper outcome.

So it seems (1) is logically correct, but doesn’t work in Mongo despite there’s no error message… any hint?

It seems Mongo is optimizing the query and displaying the $match stage in 4th place:

image

Try to wrap this:

with $expr operator in your $match stage

1 Like

Great, now it works!

The query looks like this:

[
    {
        "$lookup": /*Une cada documento con los demás -incluyendo consigo mismo-, que agrupa en un array con tantas posiciones como documentos (8, de 0 a 7, en la colección de muestra)*/
        {
            "from": "data",
            "localField": "idDevice",
            "foreignField": "idDevice",
            "as": "array1"
        }
    },
    {
        "$unwind": "$array1" /*Descompone los archivos en función de las posiciones del array, por lo que genera 8 x 8 = 64 archivos*/
    },
    {
        "$match": {"$and": [{"data.inicio": {"$exists": true}}, {"array1.data.timestamp": {"$exists": true}}]} /*Deja pasar sólo aquellos archivos con estructura vídeo-reacción y elimina las demás combinaciones: vídeo-vídeo, reacción-reacción y reacción-vídeo (redundantes); dado que hay 3 vídeos y 5 reacciones, pasan 15 archivos*/
    },
    {
        "$project": /*Selección inicial de parámetros*/
        {
            "_id": 0,
            "idDevice": "$idDevice",
            "naBroadcast": "$data.archivo",
			"naType": "$data.tipo",
            "dtBroadcastStart": {"$toDate": "$data.inicio"},
            "dtBroadcastEnd": {"$toDate": "$data.fin"},
            "array2": "$array1.data.Trackings"
        }
    },
    {
		"$unwind": "$array2" /*Descomposición en función de las posiciones del array; como en los documentos de muestra no existe más que el índice 0, genera de nuevo 15 archivos*/
	},
	{
        "$project": /*Proyección final de parámetros*/
        {
            "idDevice": 1,
            "naBroadcast": 1,
			"naType": 1,
            "dtBroadcastStart": 1,
            "dtBroadcastEnd": 1,
            "qtBroadcastDurationS": {"$divide": [{"$subtract": [{"$toDate": "$dtBroadcastEnd"}, {"$toDate": "$dtBroadcastStart"}]}, 1000]},
            "idPerson": "$array2.PersonID",
			"dtTrackingStart": {"$toDate": "$array2.BeginTime"},
			"dtTrackingEnd": {"$add": [{"$toDate": "$array2.BeginTime"}, {"$multiply": ["$array2.TrackingDuration", 1000]}]},
			"qtFaceDetected": 
			{
				"$cond": 
				{
					"if": {"$eq": ["$array2.FaceInfo.IsDetected", true]}, 
					"then": 1, 
					"else": 0
				}
			},
			"qtMaleProbability": "$array2.FaceInfo.MaleProbability",
			"qtAge": "$array2.FaceInfo.Age",
			"naEmotion": "$array2.FaceInfo.Emotion",
			"qtGaze": "$array2.FaceInfo.numGazes",
			"qtGazeDurationS": "$array2.FaceInfo.gazeTime",
			"qtFaceDurationS": "$array2.LookingDuration",
			"qtTrackingDurationS": "$array2.TrackingDuration",
			"qtReId": "$array2.ReIDInfo.NumReIDs"
        }
	},
    {
        "$match": /*Filtrado de los documentos que cumplen las condiciones de solapamiento, que para la colección de muestra son sólo 6*/
        {
            "$expr":
            {
                "$and": 
                [
                    {"$lt": ["$dtBroadcastStart", "$dtTrackingEnd"]},
                    {"$gt": ["$dtBroadcastEnd", "$dtTrackingStart"]}
                ]
            }
        }
    }
]

I guess it’s not a very efficient one, but as my first approach to $lookup I’m pretty satisfied.

Thanks a lot for your time, @slava!

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.