MARIA: Multiple-alignment $r$-index with aggregation
Adri\'an Goga, Andrej Bal\'a\v{z}, Alessia Petescia, Travis Gagie

TL;DR
MARIA is a compact index that efficiently identifies all alignment columns where pattern matches begin, helping to filter redundant genomic matches in large datasets.
Contribution
Introduces MARIA, a simple and space-efficient index that leverages multiple alignments to quickly locate distinct match columns in genomic datasets.
Findings
Reduces the number of matches to relevant alignment columns
Enables fast retrieval of match positions in large genomic collections
Improves efficiency over existing indexes for multiple genome datasets
Abstract
There now exist compact indexes that can efficiently list all the occurrences of a pattern in a dataset consisting of thousands of genomes, or even all the occurrences of all the pattern's maximal exact matches (MEMs) with respect to the dataset. Unless we are lucky and the pattern is specific to only a few genomes, however, we could be swamped by hundreds of matches -- or even hundreds per MEM -- only to discover that most or all of the matches are to substrings that occupy the same few columns in a multiple alignment. To address this issue, in this paper we present a simple and compact data index MARIA that stores a multiple alignment such that, given the position of one match of a pattern (or a MEM or other substring of a pattern) and its length, we can quickly list all the distinct columns of the multiple alignment where matches start.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Plant nutrient uptake and metabolism
