Lossy Compression of Quality Values via Rate Distortion Theory
Himanshu Asnani, Dinesh Bharadia, Mainak Chowdhury, Idoia Ochoa, Itai, Sharon, Tsachy Weissman

TL;DR
This paper introduces a theoretical framework for lossy compression of genomic quality values, achieving significant data reduction with minimal impact on downstream analysis, including regimes below one bit per quality value.
Contribution
It presents a novel rate distortion theory-based approach for lossy compression of quality values in genomic data, enabling ultra-low bit rate compression regimes not previously possible.
Findings
Significant compression with minimal impact on downstream tasks
Theoretical analysis supports practical effectiveness
Achieves compression below one bit per quality value
Abstract
Motivation: Next Generation Sequencing technologies revolutionized many fields in biology by enabling the fast and cheap sequencing of large amounts of genomic data. The ever increasing sequencing capacities enabled by current sequencing machines hold a lot of promise as for the future applications of these technologies, but also create increasing computational challenges related to the analysis and storage of these data. A typical sequencing data file may occupy tens or even hundreds of gigabytes of disk space, prohibitively large for many users. Raw sequencing data consists of both the DNA sequences (reads) and per-base quality values that indicate the level of confidence in the readout of these sequences. Quality values account for about half of the required disk space in the commonly used FASTQ format and therefore their compression can significantly reduce storage requirements and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Advanced Data Storage Technologies
