Computing the original eBWT faster, simpler, and with less memory

Christina Boucher; Davide Cenzato; Zsuzsanna Lipt\'ak and; Massimiliano Rossi; Marinella Sciortino

arXiv:2106.11191·cs.DS·June 22, 2021

Computing the original eBWT faster, simpler, and with less memory

Christina Boucher, Davide Cenzato, Zsuzsanna Lipt\'ak and, Massimiliano Rossi, Marinella Sciortino

PDF

2 Repos

TL;DR

This paper introduces a fast, simple, and memory-efficient linear-time algorithm for constructing the original eBWT of string collections, improving scalability and performance over previous methods, and includes a novel approach for single-string BWT computation without end markers.

Contribution

It presents the first linear-time algorithm for the original eBWT that avoids preprocessing and introduces a new method for BWT of a single string without end symbols or Lyndon rotations.

Findings

01

Our algorithm is the fastest for various genome collections.

02

Peak memory usage is at most twice that of the second best method.

03

Achieved a 57.1x reduction in peak memory compared to similar methods.

Abstract

Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of the BWT to a collection of strings, however, since this introduction, it has been used more generally to describe any BWT of a collection of strings and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original eBWT, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the BWT of a single string that uses neither an end-of-string symbol nor Lyndon rotations. We combine our new eBWT construction with a variation of prefix-free parsing to allow for scalable construction of the eBWT. We evaluate our algorithm (pfpebwt) on sets of human chromosomes 19, Salmonella, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.