Prefix-Free Parsing for Building Big BWTs

Christina Boucher; Travis Gagie; Alan Kuhnle; Ben Langmead; Giovanni; Manzini; Taher Mun

arXiv:1803.11245·cs.DS·November 19, 2018

Prefix-Free Parsing for Building Big BWTs

Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni, Manzini, Taher Mun

PDF

1 Repo

TL;DR

This paper introduces prefix-free parsing, a preprocessing method that efficiently constructs large, compressed BWT-based indexes for massive, highly-repetitive genomic databases, significantly reducing memory and time requirements.

Contribution

The paper presents a novel prefix-free parsing algorithm that enables scalable, memory-efficient construction of BWT-based indexes for extremely large genomic datasets.

Findings

01

D and P are much smaller than T in practice

02

Constructed a 131MB run-length compressed FM-index for 1000 human chromosomes in 2 hours

03

Estimated 102 hours to build a 6.73GB index for 1000 human genomes using 1TB memory

Abstract

High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive---a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as {\em prefix-free parsing}, that takes a text $T$ as input, and in one-pass generates a dictionary $D$ and a parse $P$ of $T$ with the property that the BWT of $T$ can be constructed from $D$ and $P$ using workspace proportional to their total size and $O (∣ T ∣)$ -time. Our experiments show that $D$ and $P$ are significantly smaller than $T$ in practice,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://gitlab.com/manzai/Big-BWT
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.