Fast and Small Subsampled R-indexes

Dustin Cobas; Travis Gagie; Gonzalo Navarro

arXiv:2409.14654·cs.DS·September 24, 2024

Fast and Small Subsampled R-indexes

Dustin Cobas, Travis Gagie, Gonzalo Navarro

PDF

Open Access

TL;DR

This paper introduces the $sr$-index, a space-efficient variant of the $r$-index that performs well on repetitive texts, and extends it to the $r$-csa and $sr$-csa, improving pattern matching efficiency and space usage.

Contribution

It proposes the $sr$-index, reducing space compared to the $r$-index, and extends the approach to CSA-based indexes, enhancing performance on repetitive texts.

Findings

01

$sr$-index uses 1.5-4 times less space than $r$-index on real data.

02

$sr$-index retains $r$-index performance while being more space-efficient.

03

$sr$-csa outperforms $sr$-index on larger alphabet and DNA texts.

Abstract

The $r$ -index represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude in query time. Its space usage, $O (r)$ where $r$ is the number of runs in the Burrows--Wheeler Transform of the text, is however higher than Lempel--Ziv (LZ) and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. We introduce the $sr$ -index, a variant that limits the space to $O (min (r, n / s))$ for a text of length $n$ and a given parameter $s$ , at the expense of multiplying by $s$ the time per occurrence reported. The $sr$ -index is obtained subsampling the text positions indexed by the $r$ -index, being still able to support pattern matching with guaranteed performance. Our experiments show that the theoretical analysis falls short in describing the practical advantages of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Data Mining Algorithms and Applications · Data Management and Algorithms