Using an index to intersect intervals

Platon_workaccount · April 13, 2020, 8:29pm

Purpose.

Select c1 collection documents, each of which overlaps with at least one c2 document and at least one c3 document by chrom field and by start-end intervals.

Collections.

c1:

chrom	start	end
chr1	1	10
chr1	10	20
chr1	20	30
chr1	30	40
chr1	40	50
(and then a lot of similar documents)

c2:

chrom	start	end
chr1	5	8
chr1	5	16
chr1	25	32
chr1	50	64
chr1	100	128
(and then a lot of similar documents)

c3:

chrom	start	end
chr1	7	14
chr1	28	49
chr2	1	7
chr2	14	21
chr2	70	77
(and then a lot of similar documents)

Indexes.

chrom_1_start_1_end_1
chrom_1
start_1
end_1

Aggregation pipeline.

pipeline = [{'$lookup': {'from': right_coll_name,
                         'let': {'chrom': '$chrom', 'start': '$start', 'end': '$end'},
                         'pipeline': [{'$match': {'$expr': {'$and': [{'$eq': ['$$chrom', '$chrom']},
                                                                     {'$lt': [{'$max': ['$$start', '$start']},
                                                                              {'$min': ['$$end', '$end']}]}]}}}],
                         'as': right_coll_name}} for right_coll_name in right_coll_names])

Technically, this code is fully functional.

Joint documents filtration.

Further I successfully filter the merged documents by doc[right_coll_alias] != []. Full Python code is here.

Problem.

Based on the terrible speed, left_coll_obj.aggregate(pipeline) doesn’t use indexes.

Question.

How do I rework a pipeline to use compound or single indexes?

Platon_workaccount · April 16, 2020, 7:54pm

If you get rid of min/max, the indexes will also be ignored. Here’s an updated pipeline:

pipeline = [{'$lookup': {'from': right_coll_name,
                         'let': {'chrom': '$chrom', 'start': '$start', 'end': '$end'},
                         'pipeline': [{'$match': {'$expr': {'$and': [{'$eq': ['$$chrom', '$chrom']},
                                                                     {'$lt': ['$$start', '$end']},
                                                                     {'$lt': ['$start', '$$end']}]}}}],
                         'as': right_coll_name.replace('.', '_')}} for right_coll_name in right_coll_names]

Platon_workaccount · October 31, 2020, 3:21pm

In MongoDB 4.4.1 + PyMongo 3.11.0 the issue persists. I would ask the developers to investigate the problem.

wan · November 2, 2020, 6:15am

Hi @Platon_workaccount,

First, it would be easier to debug performance if you separate the layers (application/db). If you haven’t done so already, I’d suggest to run a single aggregation pipeline directly (using mongo shell) instead of running from your application and see whether there is another possible bottleneck (application layer).

Note that you can view detailed information regarding an execution plan of an aggregation pipeline by using explain(). See also Return Information on Aggregation Pipeline Operation. This method is useful for debugging, as it details the processing (also shows which index, if any, the aggregation pipeline operation used).

If you execute explain() on your current pipeline, most likely it would show COLLSCAN. This is because you haven’t specified any query stages before the $lookup stage. If an entire collection is being loaded without any filtering criteria, a collection scan would have less overhead than iterating an index.

For example, you could try adding {"$match":{"chrom":"chr1"}} before the $lookup stage and you should see on the explain output that it would utilise IXSCAN.

Having said all that above, depending on your use case, it looks like you’re going to be performing multiple expressive $lookup. In this case, I would suggest to re-consider your data modelling. For example, you may try to store all the data in one collection instead and use a field to filter. i.e. chrom1, chrom2, etc.

Regards,
Wan.

Platon_workaccount · November 2, 2020, 3:35pm

Before researching separate expressions in Shell, I decided to create application level performance scheme. The picture shows that an expression with a simultaneous presence of $eq, $lt and $lt differs significantly in speed from the other expressions. Could it be a MongoDB bug?

Platon_workaccount · November 27, 2020, 8:14pm

Debug via explain()

Query by intervals:

db.bed_1.bed.explain().aggregate([{'$lookup': {'from': 'bed_2.bed', 'let': {'chrom': '$chrom', 'start': '$start', 'end': '$end'}, 'pipeline': [{'$match': {'$expr': {'$and': [{'$eq': ['$$chrom', '$chrom']}, {'$lt': ['$$start', '$end']}, {'$lt': ['$start', '$$end']}]}}}], 'as': 'bed_2_bed'}}])

{
	"stages" : [
		{
			"$cursor" : {
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "int_sub_big_BED.bed_1.bed",
					"indexFilterSet" : false,
					"parsedQuery" : {
						
					},
					"queryHash" : "8B3D4AB8",
					"planCacheKey" : "8B3D4AB8",
					"winningPlan" : {
						"stage" : "COLLSCAN",
						"direction" : "forward"
					},
					"rejectedPlans" : [ ]
				}
			}
		},
		{
			"$lookup" : {
				"from" : "bed_2.bed",
				"as" : "bed_2_bed",
				"let" : {
					"chrom" : "$chrom",
					"start" : "$start",
					"end" : "$end"
				},
				"pipeline" : [
					{
						"$match" : {
							"$expr" : {
								"$and" : [
									{
										"$eq" : [
											"$$chrom",
											"$chrom"
										]
									},
									{
										"$lt" : [
											"$$start",
											"$end"
										]
									},
									{
										"$lt" : [
											"$start",
											"$$end"
										]
									}
								]
							}
						}
					}
				]
			}
		}
	],
	"serverInfo" : {
		"host" : "platon-VivoBook-ASUSLaptop-X712FA-X712FA",
		"port" : 27017,
		"version" : "4.4.2",
		"gitVersion" : "15e73dc5738d2278b688f8929aee605fe4279b0e"
	},
	"ok" : 1
}

One field query:

db.bed_1.bed.explain().aggregate([{'$lookup': {'from': 'bed_2.bed', 'localField': 'name', 'foreignField': 'name', 'as': 'bed_2_bed'}}])

{
	"stages" : [
		{
			"$cursor" : {
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "int_sub_big_BED.bed_1.bed",
					"indexFilterSet" : false,
					"parsedQuery" : {
						
					},
					"queryHash" : "8B3D4AB8",
					"planCacheKey" : "8B3D4AB8",
					"winningPlan" : {
						"stage" : "COLLSCAN",
						"direction" : "forward"
					},
					"rejectedPlans" : [ ]
				}
			}
		},
		{
			"$lookup" : {
				"from" : "bed_2.bed",
				"as" : "bed_2_bed",
				"localField" : "name",
				"foreignField" : "name"
			}
		}
	],
	"serverInfo" : {
		"host" : "platon-VivoBook-ASUSLaptop-X712FA-X712FA",
		"port" : 27017,
		"version" : "4.4.2",
		"gitVersion" : "15e73dc5738d2278b688f8929aee605fe4279b0e"
	},
	"ok" : 1
}

As I can see, in both cases COLLSCAN was output. But Compass shows the usage of the indexes of the bed_2.bed collection. In case of intersection by a single field name Compass shows hundreds of thousands uses of the name_1 index. When intersecting by intervals there are only a few uses of the chrom_1_start_1_end_1 index.

Platon_workaccount · December 3, 2020, 1:14pm

Will the terrible speed of interval queries be considered a MongoDB bug?

Platon_workaccount · June 13, 2021, 8:50pm

Looks like this is fixed in MongoDB 5.0.

Starting in MongoDB 5.0, the $eq , $lt , $lte , $gt , and $gte operators placed in an $expr operator can use indexes to improve performance.

system · June 18, 2021, 8:50pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.