Disk-based genome sequencing data compression

Szymon Grabowski; Sebastian Deorowicz; {\L}ukasz Roguski

arXiv:1405.6874·cs.DS·September 19, 2014

Disk-based genome sequencing data compression

Szymon Grabowski, Sebastian Deorowicz, {\L}ukasz Roguski

PDF

Open Access

TL;DR

This paper introduces ORCOM, a disk-based genome sequencing data compressor that uses minimizers to achieve significantly better compression ratios than previous methods, enabling efficient storage of large sequencing datasets.

Contribution

ORCOM is a novel disk-based compression algorithm for sequencing reads that employs minimizers for improved compression efficiency and parallelization.

Findings

01

Achieves 0.317 bits per base compression ratio.

02

Compresses 134.0 Gb dataset into 5.31 GB.

03

Outperforms previous BWT-based methods.

Abstract

Motivation: High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk-based~(Yanovsky, 2011; Cox et al., 2012), where the better of these two, from Cox~{\it et al.}~(2012), is based on the Burrows--Wheeler transform (BWT) and achieves 0.518 bits per base for a 134.0 Gb human genome sequencing collection with almost 45-fold coverage. Results: We propose ORCOM (Overlapping Reads COmpression with Minimizers), a compression algorithm dedicated to sequencing reads (DNA only). Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Advanced Data Storage Technologies · Plant nutrient uptake and metabolism