# Optimal-Time Text Indexing in BWT-runs Bounded Space

**Authors:** Travis Gagie, Gonzalo Navarro, Nicola Prezza

arXiv: 1705.10382 · 2017-07-13

## TL;DR

This paper introduces an advanced text index that efficiently counts and locates pattern occurrences in highly repetitive texts within space bounds related to the BWT run count, achieving optimal query times.

## Contribution

It extends the Run-Length FM-index to support efficient locating of pattern occurrences and achieves optimal time bounds in space proportional to the number of BWT runs.

## Key findings

- Supports locating pattern occurrences in O(r) space and loglogarithmic time per occurrence.
- Achieves optimal pattern counting in O(m) time within O(r log(n/r)) space.
- Provides a text extraction structure with near-optimal time complexity.

## Abstract

Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is $r$, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used $O(r)$ space and was able to efficiently count the number of occurrences of a pattern of length $m$ in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of $r$. Since then, a number of other indexes with space bounded by other measures of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors --- have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the $occ$ occurrences efficiently within $O(r)$ space (in loglogarithmic time each), and reaching optimal time $O(m+occ)$ within $O(r\log(n/r))$ space, on a RAM machine of $w=\Omega(\log n)$ bits. Within $O(r\log (n/r))$ space, our index can also count in optimal time $O(m)$. Raising the space to $O(r w\log_\sigma(n/r))$, we support count and locate in $O(m\log(\sigma)/w)$ and $O(m\log(\sigma)/w+occ)$ time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using $O(r\log(n/r))$ space that replaces the text and extracts any text substring of length $\ell$ in almost-optimal time $O(\log(n/r)+\ell\log(\sigma)/w)$. (...continues...)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1705.10382/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/1705.10382/full.md

## References

98 references — full list in the complete paper: https://tomesphere.com/paper/1705.10382/full.md

---
Source: https://tomesphere.com/paper/1705.10382