Integrating sequencing datasets to form highly confident SNP and indel genotype calls for a whole human genome
Justin M. Zook, Brad Chapman, Jason Wang, David Mittelman, Oliver, Hofmann, Winston Hide, Marc Salit

TL;DR
This paper develops an integrated approach combining multiple sequencing datasets to produce highly confident and benchmarked SNP and indel genotype calls across the human genome, aiding clinical and research applications.
Contribution
It introduces a novel method for integrating diverse sequencing data to generate a comprehensive, unbiased, and highly confident set of human genome genotypes for benchmarking purposes.
Findings
Generated a highly confident genotype set for NA12878
Identified regions with uncertain genotype calls and their reasons
Provided publicly available benchmark data for the community
Abstract
Clinical adoption of human genome sequencing requires methods with known accuracy of genotype calls at millions or billions of positions across a genome. Previous work showing discordance amongst sequencing methods and algorithms has made clear the need for a highly accurate set of genotypes across a whole genome that could be used as a benchmark. We present methods to make highly confident SNP, indel, and homozygous reference genotype calls for NA12878, the pilot genome for the Genome in a Bottle Consortium. We minimize bias towards any method by integrating and arbitrating between 14 datasets from 5 sequencing technologies, 7 mappers, and 3 variant callers. Regions for which no confident genotype call could be made are identified as uncertain, and classified into different reasons for uncertainty. Our highly confident genotype calls are publicly available on the Genome Comparison and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Rare Diseases · Genetic Associations and Epidemiology · Biomedical Text Mining and Ontologies
