Needles in the Haystack: Identifying Individuals Present in Pooled Genomic Data
Rosemary Braun, William Rowe, Carl Schaefer, Jinghui Zhang, and, Kenneth Buetow

TL;DR
This paper critically evaluates a genetic distance metric used to identify individuals in pooled genomic data, revealing its limitations in specificity and exploring potential improvements and applications.
Contribution
The study provides a comprehensive analysis of the assumptions, limitations, and potential uses of a novel genetic distance metric for individual identification.
Findings
Low specificity in identifying individuals in samples
Misclassifications caused by assumption violations
Potential for future research in ancestry and disease prediction
Abstract
Recent publications have described and applied a novel metric that quantifies the genetic distance of an individual with respect to two population samples, and have suggested that the metric makes it possible to infer the presence of an individual of known genotype in a sample for which only the marginal allele frequencies are known. However, the assumptions, limitations, and utility of this metric remained incompletely characterized. Here we present an exploration of the strengths and limitations of that method. In addition to analytical investigations of the underlying assumptions, we use both real and simulated genotypes to test empirically the method's accuracy. The results reveal that, when used as a means by which to identify individuals as members of a population sample, the specificity is low in several circumstances. We find that the misclassifications stem from violations of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
