Efficient construction of the extended BWT from grammar-compressed DNA sequencing reads
Diego Diaz-Dominguez annd Gonzalo Navarro

TL;DR
This paper introduces an algorithm that efficiently constructs the extended BWT from grammar-compressed DNA sequencing reads, leveraging string repetitions to reduce resource usage, enabling analysis of larger genomic datasets.
Contribution
The paper presents a novel method for building the extended BWT directly from grammar-compressed data, improving efficiency for highly repetitive genomic sequences.
Findings
Resource usage decreases with increased data repetitiveness.
Enables construction of self-indexes for massive genomic datasets.
Facilitates bioinformatic analysis of larger, more complex sequencing reads.
Abstract
We present an algorithm for building the extended BWT (eBWT) of a string collection from its grammar-compressed representation. Our technique exploits the string repetitions captured by the grammar to boost the computation of the eBWT. Thus, the more repetitive the collection is, the lower are the resources we use per input symbol. We rely on a new grammar recently proposed at DCC'21 whose nonterminals serve as building blocks for inducing the eBWT. A relevant application for this idea is the construction of self-indexes for analyzing sequencing reads -- massive and repetitive string collections of raw genomic data. Self-indexes have become increasingly popular in Bioinformatics as they can encode more information in less space. Our efficient eBWT construction opens the door to perform accurate bioinformatic analyses on more massive sequence datasets, which are not tractable with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Natural Language Processing Techniques · Algorithms and Data Compression
