A Fast and Small Subsampled R-index

Dustin Cobas; Travis Gagie; Gonzalo Navarro

arXiv:2103.15329·cs.DS·March 30, 2021

A Fast and Small Subsampled R-index

Dustin Cobas, Travis Gagie, Gonzalo Navarro

PDF

TL;DR

The paper introduces the sr-index, a space-efficient variant of the r-index for repetitive texts that maintains fast pattern matching while reducing space usage, outperforming most existing compressed indexes.

Contribution

It proposes the sr-index, a novel subsampled r-index that reduces space complexity with a controlled increase in query time, supported by theoretical guarantees and empirical validation.

Findings

01

The sr-index uses 1.5-3.0 times less space than the r-index.

02

It outperforms most compressed indexes in time and space on repetitive texts.

03

Lempel-Ziv indexes achieve better compression but are significantly slower.

Abstract

The $r$ -index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude. Its space usage, $O (r)$ where $r$ is the number of runs in the Burrows-Wheeler Transform of the text, is however larger than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. In this paper we introduce the $sr$ -index, a variant that limits the space to $O (min (r, n / s))$ for a text of length $n$ and a given parameter $s$ , at the expense of multiplying by $s$ the time per occurrence reported. The $sr$ -index is obtained by carefully subsampling the text positions indexed by the $r$ -index, in a way that we prove is still able to support pattern matching with guaranteed performance. Our experiments demonstrate that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.