MongoDB.live, free & fully virtual. June 9th - 10th. Register Now

Using an index to intersect intervals

Purpose.

Select c1 collection documents, each of which overlaps with at least one c2 document and at least one c3 document by chrom field and by start-end intervals.

Collections.

c1:

chrom start end
chr1 1 10
chr1 10 20
chr1 20 30
chr1 30 40
chr1 40 50

(and then a lot of similar documents)

c2:

chrom start end
chr1 5 8
chr1 5 16
chr1 25 32
chr1 50 64
chr1 100 128

(and then a lot of similar documents)

c3:

chrom start end
chr1 7 14
chr1 28 49
chr2 1 7
chr2 14 21
chr2 70 77

(and then a lot of similar documents)

Indexes.

chrom_1_start_1_end_1
chrom_1
start_1
end_1

Aggregation pipeline.

pipeline = [{'$lookup': {'from': right_coll_name,
                         'let': {'chrom': '$chrom', 'start': '$start', 'end': '$end'},
                         'pipeline': [{'$match': {'$expr': {'$and': [{'$eq': ['$$chrom', '$chrom']},
                                                                     {'$lt': [{'$max': ['$$start', '$start']},
                                                                              {'$min': ['$$end', '$end']}]}]}}}],
                         'as': right_coll_name}} for right_coll_name in right_coll_names])

Technically, this code is fully functional.

Joint documents filtration.

Further I successfully filter the merged documents by doc[right_coll_alias] != []. Full Python code is here.

Problem.

Based on the terrible speed, left_coll_obj.aggregate(pipeline) doesn’t use indexes.

Question.

How do I rework a pipeline to use compound or single indexes?

If you get rid of min/max, the indexes will also be ignored. Here’s an updated pipeline:

pipeline = [{'$lookup': {'from': right_coll_name,
                         'let': {'chrom': '$chrom', 'start': '$start', 'end': '$end'},
                         'pipeline': [{'$match': {'$expr': {'$and': [{'$eq': ['$$chrom', '$chrom']},
                                                                     {'$lt': ['$$start', '$end']},
                                                                     {'$lt': ['$start', '$$end']}]}}}],
                         'as': right_coll_name.replace('.', '_')}} for right_coll_name in right_coll_names]