Prefix-free parsing for merging big BWTs

Diego Diaz-Dominguez; Travis Gagie; Veronica Guerrini; Ben Langmead; Zsuzsanna Liptak; Giovanni Manzini; Francesco Masillo; Vikram Shivakumar

arXiv:2506.03294·cs.DS·June 9, 2025

Prefix-free parsing for merging big BWTs

Diego Diaz-Dominguez, Travis Gagie, Veronica Guerrini, Ben Langmead, Zsuzsanna Liptak, Giovanni Manzini, Francesco Masillo, Vikram Shivakumar

PDF

Open Access

TL;DR

This paper introduces a method to efficiently merge big BWTs by applying prefix-free parsing to smaller, less similar datasets, significantly reducing memory usage during construction.

Contribution

It presents a novel approach to merge BWTs of large datasets by leveraging prefix-free parsing on smaller datasets, improving memory efficiency.

Findings

01

Memory usage is drastically reduced when merging BWTs of small datasets.

02

The method is effective for datasets with low similarity, such as different species genomes.

03

The approach enables handling larger datasets than previously possible.

Abstract

When building Burrows-Wheeler Transforms (BWTs) of truly huge datasets, prefix-free parsing (PFP) can use an unreasonable amount of memory. In this paper we show how if a dataset can be broken down into small datasets that are not very similar to each other -- such as collections of many copies of genomes of each of several species, or collections of many copies of each of the human chromosomes -- then we can drastically reduce PFP's memory footprint by building the BWTs of the small datasets and then merging them into the BWT of the whole dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Genome Rearrangement Algorithms · Natural Language Processing Techniques