TL;DR
This paper introduces prefix-free parsing, a preprocessing method that efficiently constructs large, compressed BWT-based indexes for massive, highly-repetitive genomic databases, significantly reducing memory and time requirements.
Contribution
The paper presents a novel prefix-free parsing algorithm that enables scalable, memory-efficient construction of BWT-based indexes for extremely large genomic datasets.
Findings
D and P are much smaller than T in practice
Constructed a 131MB run-length compressed FM-index for 1000 human chromosomes in 2 hours
Estimated 102 hours to build a 6.73GB index for 1000 human genomes using 1TB memory
Abstract
High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive---a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as {\em prefix-free parsing}, that takes a text as input, and in one-pass generates a dictionary and a parse of with the property that the BWT of can be constructed from and using workspace proportional to their total size and -time. Our experiments show that and are significantly smaller than in practice,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
