Tailoring r-index for metagenomics

Dustin Cobas; Veli M\"akinen; Massimiliano Rossi

arXiv:2006.05871·cs.DS·June 11, 2020·1 cites

Tailoring r-index for metagenomics

Dustin Cobas, Veli M\"akinen, Massimiliano Rossi

PDF

Open Access

TL;DR

This paper introduces three space-efficient methods leveraging the r-index for metagenomic read assignment, improving speed and memory usage in pseudoalignment tasks involving highly-repetitive genomic data.

Contribution

It proposes novel solutions based on grammar compression and interleaved LCP arrays for document listing in metagenomics, optimized for the r-index structure.

Findings

01

All methods are fast on highly-repetitive data.

02

Index size overhead is comparable to the r-index.

03

Solutions outperform traditional approaches in specific scenarios.

Abstract

A basic problem in metagenomics is to assign a sequenced read to the correct species in the reference collection. In typical applications in genomic epidemiology and viral metagenomics the reference collection consists of set of species with each species represented by its highly similar strains. It has been recently shown that accurate read assignment can be achieved with $k$ -mer hashing-based pseudoalignment: A read is assigned to species A if each of its $k$ -mer hits to reference collection is located only on strains of A. We study the underlying primitives required in pseudoalignment and related tasks. We propose three space-efficient solutions building upon the document listing with frequencies problem. All the solutions use an $r$ -index (Gagie et al., SODA 2018) as an underlying index structure for the text obtained as concatenation of the set of species, as well as for each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Gene expression and cancer classification · Genomics and Phylogenetic Studies