GeneZip: A software package for storage-efficient processing of genotype data
Cameron Palmer, Itsik Pe'er

TL;DR
GeneZip is a C/C++ library that efficiently compresses large-scale genotype data, enabling lossless, fast access and significantly reducing storage requirements for genome-wide association studies with imputed datasets.
Contribution
It introduces a novel dynamic compression method based on a customized DEFLATE algorithm for genotype data, maintaining constant-time access and high compression ratios.
Findings
Achieves over 9-fold average compression ratio
Enables lossless, fast access to large genotype datasets
Supports integration with existing statistical methods
Abstract
Genome wide association studies directly assay 10^6 single nucleotide polymorphisms (SNPs) across a study cohort. Probabilistic estimation of additional sites by genotype imputation can increase this set of variants by 10- to 40-fold. Even with modest sample sizes (10^3-10^4), these resulting imputed datasets, containing 10^10-10^11 double-precision values, are incompatible with simultaneous lossless storage in RAM using standard methods. Existing solutions for this problem require compromises in either genotype accuracy or complexity of permissible statistical methods. Here, we present a C/C++ library that dynamically compresses probabilistic genotype data as they are loaded into memory. This method uses a customization of the DEFLATE (gzip) algorithm, and maintains constant-time access to any SNP. Average compression ratios of more than 9-fold are observed in test data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenetic Associations and Epidemiology · Gene expression and cancer classification · Genetic Mapping and Diversity in Plants and Animals
