GDC 2: Compression of large collections of genomes
Sebastian Deorowicz, Agnieszka Danek, Marcin Niemiec

TL;DR
This paper introduces a highly efficient compression algorithm for large genomic datasets, achieving significant size reduction and fast processing speeds, facilitating cost-effective storage of extensive human genome collections.
Contribution
The paper presents a novel compression algorithm that outperforms existing methods in both compression ratio and speed for large genomic collections.
Findings
Achieves 9,500-fold compression of 1092 human genomes.
Compresses 6.7 TB of data into about 700MB.
Processes data at 200MB/s on a modern workstation.
Abstract
The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid in the personalized medicine. One of the significant side effects of this change is a necessity of storage and transfer of huge amounts of genomic data. In this paper we deal with the problem of compression of large collections of complete genomic sequences. We propose an algorithm that is able to compress the collection of 1092 human diploid genomes about 9,500 times. This result is about 4 times better than what is offered by the other existing compressors. Moreover, our algorithm is very fast as it processes the data with speed 200MB/s on a modern workstation. In a consequence the proposed algorithm allows storing the complete genomic collections at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
