Adaptive reference-free compression of sequence quality scores

Lilian Janin; Giovanna Rosone; Anthony J. Cox

arXiv:1305.0159·q-bio.GN·May 2, 2013·1 cites

Adaptive reference-free compression of sequence quality scores

Lilian Janin, Giovanna Rosone, Anthony J. Cox

PDF

Open Access

TL;DR

This paper introduces a reference-free method for compressing DNA sequencing quality scores by predicting and smoothing redundant scores, achieving significant compression with minimal impact on variant calling accuracy.

Contribution

It presents a novel, reference-free approach that leverages read redundancy to efficiently compress quality scores in genomic data.

Findings

01

Achieves 1 bit per quality score compression with negligible variant calling impact.

02

Uses aggressive smoothing to reduce data size significantly.

03

Applicable to various sequencing data types without needing a reference genome.

Abstract

Motivation: Rapid technological progress in DNA sequencing has stimulated interest in compressing the vast datasets that are now routinely produced. Relatively little attention has been paid to compressing the quality scores that are assigned to each sequence, even though these scores may be harder to compress than the sequences themselves. By aggregating a set of reads into a compressed index, we find that the majority of bases can be predicted from the sequence of bases that are adjacent to them and hence are likely to be less informative for variant calling or other applications. The quality scores for such bases are aggressively compressed, leaving a relatively small number at full resolution. Since our approach relies directly on redundancy present in the reads, it does not need a reference sequence and is therefore applicable to data from metagenomics and de novo experiments as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Phylogenetic Studies · Algorithms and Data Compression · Gene expression and cancer classification