Optimal Construction of Compressed Indexes for Highly Repetitive Texts

Dominik Kempa

arXiv:1712.04886·cs.DS·December 9, 2020

Optimal Construction of Compressed Indexes for Highly Repetitive Texts

Dominik Kempa

PDF

TL;DR

This paper introduces optimal algorithms for constructing key compressed text indexes like BWT, PLCP, and LZ77 parsing, specifically optimized for highly repetitive texts with low run counts, improving efficiency over previous methods.

Contribution

It presents new algorithms that are both time and space optimal for highly repetitive texts, leveraging the measure of repetitiveness to outperform existing general algorithms.

Findings

01

Algorithms run in $O(n/ ext{log}_{\sigma} n + r ext{ polylog } n)$ time

02

Significant improvements over previous $O(n)$ time algorithms for certain inputs

03

Applicable to various string processing problems like Lyndon factorization and run-length compressed suffix arrays

Abstract

We propose algorithms that, given the input string of length $n$ over integer alphabet of size $σ$ , construct the Burrows-Wheeler transform (BWT), the permuted longest-common-prefix (PLCP) array, and the LZ77 parsing in $O (n / lo g_{σ} n + r polylog n)$ time and working space, where $r$ is the number of runs in the BWT of the input. These are the essential components of many compressed indexes such as compressed suffix tree, FM-index, and grammar and LZ77-based indexes, but also find numerous applications in sequence analysis and data compression. The value of $r$ is a common measure of repetitiveness that is significantly smaller than $n$ if the string is highly repetitive. Since just accessing every symbol of the string requires $Ω (n / lo g_{σ} n)$ time, the presented algorithms are time and space optimal for inputs satisfying the assumption $n/r\in\Omega({\rm…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.