Second-generation PLINK: rising to the challenge of larger and richer datasets
Christopher C. Chang, Carson C. Chow, Laurent C.A.M. Tellier and, Shashaank Vattikuti, Shaun M. Purcell, James J. Lee

TL;DR
The paper introduces PLINK 1.9, a significantly faster and more scalable version of the GWAS toolset, capable of handling larger, richer datasets with new data formats and algorithmic improvements.
Contribution
Development of PLINK 1.9 with extensive algorithmic enhancements and a new data format, enabling faster analysis of large, complex genetic datasets.
Findings
Operations accelerated by 1-4 orders of magnitude
Able to handle datasets exceeding RAM capacity
Enhanced data format supports probabilistic and multiallelic data
Abstract
PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for even faster and more scalable implementations of key functions. In addition, GWAS and population-genetic data now frequently contain probabilistic calls, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
