Reference Based Genome Compression
Bobbie Chern, Idoia Ochoa, Alexandros Manolakos, Albert No, Kartik, Venkat, Tsachy Weissman

TL;DR
This paper introduces a reference-based genome compression algorithm that significantly reduces genome storage size by leveraging known reference genomes, outperforming generic compression tools.
Contribution
The paper presents a novel algorithm that compresses target genomes using reference genomes, achieving higher compression ratios than standard methods.
Findings
Reduces Watson's genome from 2991 MB to 6.99 MB
Gzip compresses Watson's genome to 834.8 MB
Demonstrates superior compression efficiency over generic tools
Abstract
DNA sequencing technology has advanced to a point where storage is becoming the central bottleneck in the acquisition and mining of more data. Large amounts of data are vital for genomics research, and generic compression tools, while viable, cannot offer the same savings as approaches tuned to inherent biological properties. We propose an algorithm to compress a target genome given a known reference genome. The proposed algorithm first generates a mapping from the reference to the target genome, and then compresses this mapping with an entropy coder. As an illustration of the performance: applying our algorithm to James Watson's genome with hg18 as a reference, we are able to reduce the 2991 megabyte (MB) genome down to 6.99 MB, while Gzip compresses it to 834.8 MB.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
