Algorithms for Large-scale Whole Genome Association Analysis
Elmar Peise (1), Diego Fabregat (1), Yurii Aulchenko (2), Paolo, Bientinesi (1) ((1) AICES, RWTH Aachen, (2) Institute of Cytology and, Genetics, Novosibirsk)

TL;DR
This paper introduces scalable algorithms for large-scale genome-wide association studies, efficiently handling massive genotype datasets and covariance matrices across distributed systems.
Contribution
It presents novel streaming and distributed memory techniques to process enormous genetic datasets that exceed main memory capacity.
Findings
Enables analysis of datasets with millions of polymorphisms
Maintains high performance with distributed memory and streaming
Supports genome-wide association studies on large populations
Abstract
In order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms come at the cost of hundreds of Gigabytes of genotype data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a covariance matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, we present solutions for both challenges: The genotype data is streamed from and to secondary storage using a double buffering technique, while the covariance matrix is kept across the main memory of a distributed memory system. We show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
