Optimal Construction of Compressed Indexes for Highly Repetitive Texts
Dominik Kempa

TL;DR
This paper introduces optimal algorithms for constructing key compressed text indexes like BWT, PLCP, and LZ77 parsing, specifically optimized for highly repetitive texts with low run counts, improving efficiency over previous methods.
Contribution
It presents new algorithms that are both time and space optimal for highly repetitive texts, leveraging the measure of repetitiveness to outperform existing general algorithms.
Findings
Algorithms run in $O(n/ ext{log}_{\sigma} n + r ext{ polylog } n)$ time
Significant improvements over previous $O(n)$ time algorithms for certain inputs
Applicable to various string processing problems like Lyndon factorization and run-length compressed suffix arrays
Abstract
We propose algorithms that, given the input string of length over integer alphabet of size , construct the Burrows-Wheeler transform (BWT), the permuted longest-common-prefix (PLCP) array, and the LZ77 parsing in time and working space, where is the number of runs in the BWT of the input. These are the essential components of many compressed indexes such as compressed suffix tree, FM-index, and grammar and LZ77-based indexes, but also find numerous applications in sequence analysis and data compression. The value of is a common measure of repetitiveness that is significantly smaller than if the string is highly repetitive. Since just accessing every symbol of the string requires time, the presented algorithms are time and space optimal for inputs satisfying the assumption $n/r\in\Omega({\rm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
