Computing the optimal BWT of very large string collections
Davide Cenzato, Veronica Guerrini, Zsuzsanna Lipt\'ak, and Giovanna, Rosone

TL;DR
This paper introduces optBWT, a tool that computes the BWT of large string collections with the minimum number of runs, significantly improving compression potential while maintaining efficiency.
Contribution
It presents the first practical tool combining existing algorithms and data structures to guarantee the minimal run BWT for large string collections.
Findings
Up to 31 times fewer runs on real data
Negligible overhead compared to existing methods
Significant reduction in run count improves compression
Abstract
It is known that the exact form of the Burrows-Wheeler-Transform (BWT) of a string collection depends, in most implementations, on the input order of the strings in the collection. Reordering strings of an input collection affects the number of equal-letter runs , arguably the most important parameter of BWT-based data structures, such as the FM-index or the -index. Bentley, Gibney, and Thankachan [ESA 2020] introduced a linear-time algorithm for computing the permutation of the input collection which yields the minimum number of runs of the resulting BWT. In this paper, we present the first tool that guarantees a Burrows-Wheeler-Transform with minimum number of runs (optBWT), by combining i) an algorithm that builds the BWT from a string collection (either SAIS-based [Cenzato et al., SPIRE 2021] or BCR [Bauer et al., CPM 2011]); ii) the SAP array data structure introduced in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Data Mining Algorithms and Applications
