TL;DR
This paper introduces a fast, simple, and memory-efficient linear-time algorithm for constructing the original eBWT of string collections, improving scalability and performance over previous methods, and includes a novel approach for single-string BWT computation without end markers.
Contribution
It presents the first linear-time algorithm for the original eBWT that avoids preprocessing and introduces a new method for BWT of a single string without end symbols or Lyndon rotations.
Findings
Our algorithm is the fastest for various genome collections.
Peak memory usage is at most twice that of the second best method.
Achieved a 57.1x reduction in peak memory compared to similar methods.
Abstract
Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of the BWT to a collection of strings, however, since this introduction, it has been used more generally to describe any BWT of a collection of strings and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original eBWT, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the BWT of a single string that uses neither an end-of-string symbol nor Lyndon rotations. We combine our new eBWT construction with a variation of prefix-free parsing to allow for scalable construction of the eBWT. We evaluate our algorithm (pfpebwt) on sets of human chromosomes 19, Salmonella, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
