MARIA: Multiple-alignment $r$-index with aggregation

Adri\'an Goga; Andrej Bal\'a\v{z}; Alessia Petescia; Travis Gagie

arXiv:2209.09218·cs.DS·September 20, 2022·1 cites

MARIA: Multiple-alignment $r$-index with aggregation

Adri\'an Goga, Andrej Bal\'a\v{z}, Alessia Petescia, Travis Gagie

PDF

Open Access

TL;DR

MARIA is a compact index that efficiently identifies all alignment columns where pattern matches begin, helping to filter redundant genomic matches in large datasets.

Contribution

Introduces MARIA, a simple and space-efficient index that leverages multiple alignments to quickly locate distinct match columns in genomic datasets.

Findings

01

Reduces the number of matches to relevant alignment columns

02

Enables fast retrieval of match positions in large genomic collections

03

Improves efficiency over existing indexes for multiple genome datasets

Abstract

There now exist compact indexes that can efficiently list all the occurrences of a pattern in a dataset consisting of thousands of genomes, or even all the occurrences of all the pattern's maximal exact matches (MEMs) with respect to the dataset. Unless we are lucky and the pattern is specific to only a few genomes, however, we could be swamped by hundreds of matches -- or even hundreds per MEM -- only to discover that most or all of the matches are to substrings that occupy the same few columns in a multiple alignment. To address this issue, in this paper we present a simple and compact data index MARIA that stores a multiple alignment such that, given the position of one match of a pattern (or a MEM or other substring of a pattern) and its length, we can quickly list all the distinct columns of the multiple alignment where matches start.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Plant nutrient uptake and metabolism