TL;DR
This paper presents a fast, memory-efficient online sampling method for large genomic datasets, enabling quick analysis of DNA variants and LD decay patterns on standard computers.
Contribution
It introduces a novel implementation of an old sampling technique tailored for modern genomic data, with open-source tools for sampling and LD analysis.
Findings
Sampling method performs well on SSD and HDD.
Enables rapid estimation of LD decay patterns.
Provides open-source software for genomic data sampling.
Abstract
Explosive growth in the amount of genomic data is matched by increasing power of consumer-grade computers. Even applications that require powerful servers can be quickly tested on desktop or laptop machines if we can generate representative samples from large data sets. I describe a fast and memory-efficient implementation of an on-line sampling method developed for tape drives 30 years ago. Focusing on genotype files, I test the performance of this technique on modern solid-state and spinning hard drives, and show that it performs well compared to a simple sampling scheme. I illustrate its utility by developing a method to quickly estimate genome-wide patterns of linkage disequilibrium (LD) decay with distance. I provide open-source software that samples loci from several variant format files, a separate program that performs LD decay estimates, and a C++ library that lets developers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
