Using an index to intersect intervals

Purpose.

Select c1 collection documents, each of which overlaps with at least one c2 document and at least one c3 document by chrom field and by start-end intervals.

Collections.

c1:

chrom start end
chr1 1 10
chr1 10 20
chr1 20 30
chr1 30 40
chr1 40 50

(and then a lot of similar documents)

c2:

chrom start end
chr1 5 8
chr1 5 16
chr1 25 32
chr1 50 64
chr1 100 128

(and then a lot of similar documents)

c3:

chrom start end
chr1 7 14
chr1 28 49
chr2 1 7
chr2 14 21
chr2 70 77

(and then a lot of similar documents)

Indexes.

chrom_1_start_1_end_1
chrom_1
start_1
end_1

Aggregation pipeline.

pipeline = [{'$lookup': {'from': right_coll_name,
                         'let': {'chrom': '$chrom', 'start': '$start', 'end': '$end'},
                         'pipeline': [{'$match': {'$expr': {'$and': [{'$eq': ['$$chrom', '$chrom']},
                                                                     {'$lt': [{'$max': ['$$start', '$start']},
                                                                              {'$min': ['$$end', '$end']}]}]}}}],
                         'as': right_coll_name}} for right_coll_name in right_coll_names])

Technically, this code is fully functional.

Joint documents filtration.

Further I successfully filter the merged documents by doc[right_coll_alias] != []. Full Python code is here.

Problem.

Based on the terrible speed, left_coll_obj.aggregate(pipeline) doesn’t use indexes.

Question.

How do I rework a pipeline to use compound or single indexes?

If you get rid of min/max, the indexes will also be ignored. Here’s an updated pipeline:

pipeline = [{'$lookup': {'from': right_coll_name,
                         'let': {'chrom': '$chrom', 'start': '$start', 'end': '$end'},
                         'pipeline': [{'$match': {'$expr': {'$and': [{'$eq': ['$$chrom', '$chrom']},
                                                                     {'$lt': ['$$start', '$end']},
                                                                     {'$lt': ['$start', '$$end']}]}}}],
                         'as': right_coll_name.replace('.', '_')}} for right_coll_name in right_coll_names]

In MongoDB 4.4.1 + PyMongo 3.11.0 the issue persists. I would ask the developers to investigate the problem.

Hi @Platon_workaccount,

First, it would be easier to debug performance if you separate the layers (application/db). If you haven’t done so already, I’d suggest to run a single aggregation pipeline directly (using mongo shell) instead of running from your application and see whether there is another possible bottleneck (application layer).

Note that you can view detailed information regarding an execution plan of an aggregation pipeline by using explain(). See also Return Information on Aggregation Pipeline Operation. This method is useful for debugging, as it details the processing (also shows which index, if any, the aggregation pipeline operation used).

If you execute explain() on your current pipeline, most likely it would show COLLSCAN. This is because you haven’t specified any query stages before the $lookup stage. If an entire collection is being loaded without any filtering criteria, a collection scan would have less overhead than iterating an index.

For example, you could try adding {"$match":{"chrom":"chr1"}} before the $lookup stage and you should see on the explain output that it would utilise IXSCAN.

Having said all that above, depending on your use case, it looks like you’re going to be performing multiple expressive $lookup. In this case, I would suggest to re-consider your data modelling. For example, you may try to store all the data in one collection instead and use a field to filter. i.e. chrom1, chrom2, etc.

Regards,
Wan.

1 Like

Before researching separate expressions in Shell, I decided to create application level performance scheme. The picture shows that an expression with a simultaneous presence of $eq, $lt and $lt differs significantly in speed from the other expressions. Could it be a MongoDB bug?

Debug via explain()

Query by intervals:

db.bed_1.bed.explain().aggregate([{'$lookup': {'from': 'bed_2.bed', 'let': {'chrom': '$chrom', 'start': '$start', 'end': '$end'}, 'pipeline': [{'$match': {'$expr': {'$and': [{'$eq': ['$$chrom', '$chrom']}, {'$lt': ['$$start', '$end']}, {'$lt': ['$start', '$$end']}]}}}], 'as': 'bed_2_bed'}}])

{
	"stages" : [
		{
			"$cursor" : {
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "int_sub_big_BED.bed_1.bed",
					"indexFilterSet" : false,
					"parsedQuery" : {
						
					},
					"queryHash" : "8B3D4AB8",
					"planCacheKey" : "8B3D4AB8",
					"winningPlan" : {
						"stage" : "COLLSCAN",
						"direction" : "forward"
					},
					"rejectedPlans" : [ ]
				}
			}
		},
		{
			"$lookup" : {
				"from" : "bed_2.bed",
				"as" : "bed_2_bed",
				"let" : {
					"chrom" : "$chrom",
					"start" : "$start",
					"end" : "$end"
				},
				"pipeline" : [
					{
						"$match" : {
							"$expr" : {
								"$and" : [
									{
										"$eq" : [
											"$$chrom",
											"$chrom"
										]
									},
									{
										"$lt" : [
											"$$start",
											"$end"
										]
									},
									{
										"$lt" : [
											"$start",
											"$$end"
										]
									}
								]
							}
						}
					}
				]
			}
		}
	],
	"serverInfo" : {
		"host" : "platon-VivoBook-ASUSLaptop-X712FA-X712FA",
		"port" : 27017,
		"version" : "4.4.2",
		"gitVersion" : "15e73dc5738d2278b688f8929aee605fe4279b0e"
	},
	"ok" : 1
}

One field query:

db.bed_1.bed.explain().aggregate([{'$lookup': {'from': 'bed_2.bed', 'localField': 'name', 'foreignField': 'name', 'as': 'bed_2_bed'}}])

{
	"stages" : [
		{
			"$cursor" : {
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "int_sub_big_BED.bed_1.bed",
					"indexFilterSet" : false,
					"parsedQuery" : {
						
					},
					"queryHash" : "8B3D4AB8",
					"planCacheKey" : "8B3D4AB8",
					"winningPlan" : {
						"stage" : "COLLSCAN",
						"direction" : "forward"
					},
					"rejectedPlans" : [ ]
				}
			}
		},
		{
			"$lookup" : {
				"from" : "bed_2.bed",
				"as" : "bed_2_bed",
				"localField" : "name",
				"foreignField" : "name"
			}
		}
	],
	"serverInfo" : {
		"host" : "platon-VivoBook-ASUSLaptop-X712FA-X712FA",
		"port" : 27017,
		"version" : "4.4.2",
		"gitVersion" : "15e73dc5738d2278b688f8929aee605fe4279b0e"
	},
	"ok" : 1
}

As I can see, in both cases COLLSCAN was output. But Compass shows the usage of the indexes of the bed_2.bed collection. In case of intersection by a single field name Compass shows hundreds of thousands uses of the name_1 index. When intersecting by intervals there are only a few uses of the chrom_1_start_1_end_1 index.

Will the terrible speed of interval queries be considered a MongoDB bug?