CARGO: Effective format-free compressed storage of genomic information
{\L}ukasz Roguski, Paolo Ribeca

TL;DR
CARGO is a flexible framework that automatically generates efficient compression tools for large genomic datasets, outperforming specialized solutions and enabling scalable storage of various formats.
Contribution
We introduce CARGO, a high-level framework that simplifies creating optimized, format-free compression tools for diverse genomic data collections.
Findings
CARGO matches or outperforms specialized compressors.
It scales effectively to multi-terabyte datasets.
Requires minimal code to adapt to different formats.
Abstract
The recent super-exponential growth in the amount of sequencing data generated worldwide has put techniques for compressed storage into the focus. Most available solutions, however, are strictly tied to specific bioinformatics formats, sometimes inheriting from them suboptimal design choices; this hinders flexible and effective data sharing. Here we present CARGO (Compressed ARchiving for GenOmics), a high-level framework to automatically generate software systems optimized for the compressed storage of arbitrary types of large genomic data collections. Straightforward applications of our approach to FASTQ and SAM archives require a few lines of code, produce solutions that match and sometimes outperform specialized format-tailored compressors, and scale well to multi-TB datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Algorithms and Data Compression · Gene expression and cancer classification
