Fast and Small Subsampled R-indexes
Dustin Cobas, Travis Gagie, Gonzalo Navarro

TL;DR
This paper introduces the $sr$-index, a space-efficient variant of the $r$-index that performs well on repetitive texts, and extends it to the $r$-csa and $sr$-csa, improving pattern matching efficiency and space usage.
Contribution
It proposes the $sr$-index, reducing space compared to the $r$-index, and extends the approach to CSA-based indexes, enhancing performance on repetitive texts.
Findings
$sr$-index uses 1.5-4 times less space than $r$-index on real data.
$sr$-index retains $r$-index performance while being more space-efficient.
$sr$-csa outperforms $sr$-index on larger alphabet and DNA texts.
Abstract
The -index represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude in query time. Its space usage, where is the number of runs in the Burrows--Wheeler Transform of the text, is however higher than Lempel--Ziv (LZ) and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. We introduce the -index, a variant that limits the space to for a text of length and a given parameter , at the expense of multiplying by the time per occurrence reported. The -index is obtained subsampling the text positions indexed by the -index, being still able to support pattern matching with guaranteed performance. Our experiments show that the theoretical analysis falls short in describing the practical advantages of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Data Mining Algorithms and Applications · Data Management and Algorithms
