Tailoring r-index for metagenomics
Dustin Cobas, Veli M\"akinen, Massimiliano Rossi

TL;DR
This paper introduces three space-efficient methods leveraging the r-index for metagenomic read assignment, improving speed and memory usage in pseudoalignment tasks involving highly-repetitive genomic data.
Contribution
It proposes novel solutions based on grammar compression and interleaved LCP arrays for document listing in metagenomics, optimized for the r-index structure.
Findings
All methods are fast on highly-repetitive data.
Index size overhead is comparable to the r-index.
Solutions outperform traditional approaches in specific scenarios.
Abstract
A basic problem in metagenomics is to assign a sequenced read to the correct species in the reference collection. In typical applications in genomic epidemiology and viral metagenomics the reference collection consists of set of species with each species represented by its highly similar strains. It has been recently shown that accurate read assignment can be achieved with -mer hashing-based pseudoalignment: A read is assigned to species A if each of its -mer hits to reference collection is located only on strains of A. We study the underlying primitives required in pseudoalignment and related tasks. We propose three space-efficient solutions building upon the document listing with frequencies problem. All the solutions use an -index (Gagie et al., SODA 2018) as an underlying index structure for the text obtained as concatenation of the set of species, as well as for each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Gene expression and cancer classification · Genomics and Phylogenetic Studies
