Adaptive reference-free compression of sequence quality scores
Lilian Janin, Giovanna Rosone, Anthony J. Cox

TL;DR
This paper introduces a reference-free method for compressing DNA sequencing quality scores by predicting and smoothing redundant scores, achieving significant compression with minimal impact on variant calling accuracy.
Contribution
It presents a novel, reference-free approach that leverages read redundancy to efficiently compress quality scores in genomic data.
Findings
Achieves 1 bit per quality score compression with negligible variant calling impact.
Uses aggressive smoothing to reduce data size significantly.
Applicable to various sequencing data types without needing a reference genome.
Abstract
Motivation: Rapid technological progress in DNA sequencing has stimulated interest in compressing the vast datasets that are now routinely produced. Relatively little attention has been paid to compressing the quality scores that are assigned to each sequence, even though these scores may be harder to compress than the sequences themselves. By aggregating a set of reads into a compressed index, we find that the majority of bases can be predicted from the sequence of bases that are adjacent to them and hence are likely to be less informative for variant calling or other applications. The quality scores for such bases are aggressively compressed, leaving a relatively small number at full resolution. Since our approach relies directly on redundancy present in the reads, it does not need a reference sequence and is therefore applicable to data from metagenomics and de novo experiments as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Algorithms and Data Compression · Gene expression and cancer classification
