Decomposing Words for Enhanced Compression: Exploring the Number of Runs in the Extended Burrows-Wheeler Transform
Florian Ingels, Ana\"is Denis, Bastien Cazaux

TL;DR
This paper investigates how different decompositions of a word affect the number of runs in the extended Burrows-Wheeler Transform, revealing exponential complexity and unbounded ratios that impact compression efficiency.
Contribution
It demonstrates the exponential number of possible decompositions and the unbounded ratio of runs between worst and best decompositions in eBWT.
Findings
Number of decompositions is exponential.
Unbounded ratio of runs in worst vs. best decompositions.
Decomposition choices significantly affect compression performance.
Abstract
The Burrows-Wheeler Transform (BWT) is a fundamental component in many data structures for text indexing and compression, widely used in areas such as bioinformatics and information retrieval. The extended BWT (eBWT) generalizes the classical BWT to multisets of strings, providing a flexible framework that captures many BWT-like constructions. Several known variants of the BWT can be viewed as instances of the eBWT applied to specific decompositions of a word. A central property of the BWT, essential for its compressibility, is the number of maximal ranges of equal letters, named runs. In this article, we explore how different decompositions of a word impact the number of runs in the resulting eBWT. First, we show that the number of decompositions of a word is exponential, even under minimal constraints on the size of the subsets in the decomposition. Second, we present an infinite…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genome Rearrangement Algorithms · semigroups and automata theory
