Genome Compression Against a Reference

Anirduddha Laud; Gaurav Menghani; Madhava Keralapura

arXiv:2010.02286·q-bio.GN·October 7, 2020

Genome Compression Against a Reference

Anirduddha Laud, Gaurav Menghani, Madhava Keralapura

PDF

Open Access

TL;DR

This paper improves genome compression techniques by building on DNAZip, achieving an additional 11% reduction in size, which enhances storage and transmission efficiency for human genome data.

Contribution

The work introduces an enhancement to DNAZip, providing an extra 11% compression for genome sequences relative to a reference, surpassing existing methods.

Findings

01

Achieved ~11% additional compression over DNAZip.

02

Reduced genome storage size from ~4 MB to smaller sizes.

03

Enhanced transmission efficiency for genomic data.

Abstract

Being able to store and transmit human genome sequences is an important part in genomic research and industrial applications. The complete human genome has 3.1 billion base pairs (haploid), and storing the entire genome naively takes about 3 GB, which is infeasible for large scale usage. However, human genomes are highly redundant. Any given individual's genome would differ from another individual's genome by less than 1%. There are tools like DNAZip, which express a given genome sequence by only noting down the differences between the given sequence and a reference genome sequence. This allows losslessly compressing the given genome to ~ 4 MB in size. In this work, we demonstrate additional improvements on top of the DNAZip library, where we show an additional ~ 11% compression on top of DNAZip's already impressive results. This would allow further savings in disk space and network…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCRISPR and Genetic Engineering · Race, Genetics, and Society · Genetic Mapping and Diversity in Plants and Animals