Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform
Anthony J. Cox, Markus J. Bauer, Tobias Jakobi, Giovanna Rosone

TL;DR
This paper presents a novel, efficient method for large-scale genomic data compression using the Burrows-Wheeler transform, enabling significant reduction in storage space and facilitating indexing of massive DNA datasets.
Contribution
The authors introduce an implicit sorting strategy that improves BWT-based compression of genomic data without extensive sorting overhead, achieving over 4x better compression than standard methods.
Findings
Achieved 0.5 bits per base compression for human genome data
Enabled building of compressed full-text indexes on large DNA collections
Demonstrated the effectiveness of sequence reordering and trimming for compression
Abstract
Motivation The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work, we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets. Results We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm. We demonstrate that compression may be greatly improved by a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
