Decoding a Million Genomes: Unveiling the Protein-coding Landscape and Its Implications for Precision Medicine
Jinwei Zhang

TL;DR
This paper summarizes a large-scale study of protein-coding genetic variation in over 900,000 individuals and discusses its impact on precision medicine.
Contribution
The study provides new insights into rare genetic variants and their roles in gene splicing and disease, advancing precision medicine.
Findings
Rare biallelic variants were identified, shedding light on gene function and disease mechanisms.
Loss-of-function intolerant genes were highlighted, offering clues about essential biological processes.
The findings suggest future research on non-coding DNA and regulatory RNAs in large populations.
Abstract
The study by Sun et al. , which sequenced exomes from 983,578 individuals, provides a comprehensive resource on protein-coding genetic variation. This commentary examines the key findings, including rare biallelic variants and loss-of-function intolerant genes, while emphasizing their implications for gene splicing, human knockouts, and disease-associated genes. Additionally, we discuss how these insights propel advancements in precision medicine and suggest future research directions, particularly in the study of non-coding DNA and regulatory RNAs at population scales.
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRNA and protein synthesis mechanisms · RNA Research and Splicing · RNA modifications and cancer
INTRODUCTION
1
The human genome is a reservoir of genetic information shaped by millions of years of evolution. High-throughput sequencing technologies have revolutionized the ability to catalog and interpret genetic variation, particularly in the context of health and disease [1]. The study by Sun et al. represents a significant leap forward, analyzing exomes from 983,578 individuals to uncover a detailed landscape of predicted loss-of-function (pLOF) variants [2]. By identifying rare variants across diverse populations, this dataset establishes a crucial foundation for future research and clinical applications in precision medicine.
MAIN FINDINGS
2
Sun et al. used exome sequencing to analyze genetic data from a large and diverse cohort, including 23% from non-European ancestries (Fig. 1A). The dataset, accessible through the RGC Million Exome Browser (https://rgc-research.regeneron.com/me/home), offers comprehensive variant interpretation and frequency analysis (Fig. 1B). Key findings include the identification of over 10.4 million missense variants and 1.1 million predicted loss-of-function (pLOF) mutations (Fig. 1C-D) [2], revealing a wealth of previously unexplored genetic variation. Rare biallelic pLOF variants were detected in 4,848 genes, with 1,751 of these being newly reported. Additionally, 3,988 genes were found to be highly intolerant to loss-of-function mutations, challenging prior assumptions about gene tolerance. Regions of missense depletion, identified in 1,482 genes (Fig. 1E), highlight areas where even single amino acid changes are highly deleterious. The study also sheds light on 11,773 cryptic splice sites previously categorized as variants of unknown significance in the ClinVar database (Fig. 1F), emphasizing their potential to disrupt gene splicing. These findings underscore the role of mutational constraint in shaping our genetic landscape and demonstrate how such data can guide precision medicine [1, 3].
PROTEIN-CODING VARIATION AND PRECISION MEDICINE
3
Implications for Precision Medicine
3.1
The findings from Sun et al.’s study offer valuable insights for advancing precision medicine. The dataset not only identifies clinically actionable genetic variants in approximately 3% of individuals but also highlights the significant potential for improving diagnostic and therapeutic strategies. By uncovering the prevalence and impact of these variants, the study provides a robust framework for integrating genetic data into clinical practice.
Systems-level Understanding of Disease
3.2
This research enhances the systems-level understanding of human disease by exploring the functional impact of genetic variation. The analysis of 4,848 genes with rare biallelic pLOF variants provides natural models for studying gene function and its implications in health and disease. Insights into gene splicing mechanisms are strengthened by the identification of cryptic splice sites, enhancing the precision of genetic testing and variant interpretation. Furthermore, the discovery of loss-of-function intolerant genes offers a foundation for prioritizing potential therapeutic targets, enabling the development of interventions that minimize off-target effects.
Expanding to Non-coding Regions
3.3
Beyond protein-coding regions, future efforts should focus on integrating non-coding DNA and regulatory RNAs into this dataset. Such research could uncover additional layers of genetic regulation and provide critical insights into the etiology of diseases influenced by non-coding variants. Combining this exome-focused dataset with transcriptomics and epigenomics can reveal the interplay between coding and regulatory elements, advancing our understanding of complex disorders.
Applications in Clinical Practice
3.4
The dataset has broad applications in clinical practice, from enabling more accurate genetic counseling by providing detailed allele frequency data across diverse populations to identifying new drug targets through the overrepresentation of pLOF variants in metabolic pathways. Improved annotations of cryptic splice variants and rare pLOFs further enhance the precision of genetic diagnostics, particularly for conditions that were previously unexplained.
COMPARISONS WITH RELEVANT STUDIES
4
Sun et al.’s study builds upon previous work, such as the ExAC and gnomAD projects, which provided foundational knowledge on genetic variation in smaller cohorts [3, 4]. Compared to the ExAC study that analyzed 60,706 individuals and the gnomAD project with 141,456 participants, the present study’s scope is significantly larger, encompassing nearly a million individuals. This expansion allows for more accurate estimates of allele frequencies, particularly for rare variants, and offers greater statistical power to detect significant associations. The research also aligns with findings from the UK Biobank and TOPMed programs [5-7], which emphasized the importance of diverse genetic datasets. The inclusion of non-European ancestries in Sun et al.’s study is a crucial advancement, providing insights into genetic variation across different populations and enhancing the relevance of findings to global health.
FUTURE PERSPECTIVES
5
The dataset generated by Sun et al.’s study provides an unparalleled foundation for advancing precision medicine and understanding genetic variation. Integrating this resource with multi-omics data, including transcriptomics, proteomics, and metabolomics, should be a top priority for future study in order to identify the biological processes that connect genetic variations to health and illness (Fig. 1G). Technological innovations in sequencing and bioinformatics are essential for improving variant detection and functional validation, particularly through tools like CRISPR. Detailed analyses of underrepresented populations will address disparities and reveal population-specific variants. Longitudinal studies are crucial for assessing the clinical relevance of variants over time, while effective clinical translation requires harmonized data workflows and robust computational models. Additionally, exploring non-coding DNA and regulatory RNAs on a population scale will provide new insights into gene regulation and disease processes. Expanding international collaborations and engaging the broader scientific community will accelerate the realization of precision medicine and improve global health equity.
CONCLUSION
Sun et al. ’s work marks a pivotal advancement in human genomics, providing an extensive catalog of protein-coding variation that will inform future research and improve the clinical management of genetic diseases. This resource underscores the importance of continued efforts to bridge the gap between research and clinical practice, ensuring that precision medicine delivers equitable and tangible benefits across all populations.
AUTHORS’ CONTRIBUTIONS
JZ wrote the initial version of the manuscript, revised and edited the manuscript. The author(s) read and approved the final manuscript.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Benton M.L. Abraham A. La Bella A.L. Abbot P. Rokas A. Capra J.A. The influence of evolutionary history on human health and disease.Nat. Rev. Genet.202122526928310.1038/s 41576-020-00305-933408383 PMC 7787134 · doi ↗ · pubmed ↗
- 2Sun K.Y. Bai X. Chen S. Bao S. Zhang C. Kapoor M. Backman J. Joseph T. Maxwell E. Mitra G. Gorovits A. Mansfield A. Boutkov B. Gokhale S. Habegger L. Marcketta A. Locke A.E. Ganel L. Hawes A. Kessler M.D. Sharma D. Staples J. Bovijn J. Gelfman S. Gioia D.A. Rajagopal V.M. Lopez A. Varela J.R. Díaz A.J. Berumen J. Conyer T.R. Morales K.P. Torres J. Emberson J. Collins R. Abecasis G. Coppola G. Deubler A. Economides A. Ferrando A. Lotta L.A. Shuldiner A. Siminovitch K. Beechert C. Brian E.D. Cremona L.M. Du H. Forsythe C. Gu Z. Guevara K. Lattari · doi ↗ · pubmed ↗
- 3Karczewski K.J. Francioli L.C. Tiao G. Cummings B.B. Alföldi J. Wang Q. Collins R.L. Laricchia K.M. Ganna A. Birnbaum D.P. Gauthier L.D. Brand H. Solomonson M. Watts N.A. Rhodes D. Berk S.M. England E.M. Seaby E.G. Kosmicki J.A. Walters R.K. Tashman K. Farjoun Y. Banks E. Poterba T. Wang A. Seed C. Whiffin N. Chong J.X. Samocha K.E. Hoffman P.E. Zappala Z. O’Donnell-Luria A.H. Minikel E.V. Weisburd B. Lek M. Ware J.S. Vittal C. Armean I.M. Bergelson L. Cibulskis K. Connolly K.M. Covarrubias M. Donnelly S. Ferriera S. Gabriel S. Gentry J. Gu · doi ↗ · pubmed ↗
- 4Lek M. Karczewski K.J. Minikel E.V. Samocha K.E. Banks E. Fennell T. O’Donnell-Luria A.H. Ware J.S. Hill A.J. Cummings B.B. Tukiainen T. Birnbaum D.P. Kosmicki J.A. Duncan L.E. Estrada K. Zhao F. Zou J. Hoffman P.E. Berghout J. Cooper D.N. Deflaux N. De Pristo M. Do R. Flannick J. Fromer M. Gauthier L. Goldstein J. Gupta N. Howrigan D. Kiezun A. Kurki M.I. Moonshine A.L. Natarajan P. Orozco L. Peloso G.M. Poplin R. Rivas M.A. Rubio R.V. Rose S.A. Ruderfer D.M. Shakir K. Stenson P.D. Stevens C. Thomas B.P. Tiao G. Luna T.M.T. Weisburd B. Won H. · doi ↗ · pubmed ↗
- 5Backman J.D. Li A.H. Marcketta A. Sun D. Mbatchou J. Kessler M.D. Benner C. Liu D. Locke A.E. Balasubramanian S. Yadav A. Banerjee N. Gillies C.E. Damask A. Liu S. Bai X. Hawes A. Maxwell E. Gurski L. Watanabe K. Kosmicki J.A. Rajagopal V. Mighty J. Jones M. Mitnaul L. Stahl E. Coppola G. Jorgenson E. Habegger L. Salerno W.J. Shuldiner A.R. Lotta L.A. Overton J.D. Cantor M.N. Reid J.G. Yancopoulos G. Kang H.M. Marchini J. Baras A. Abecasis G.R. Ferreira M.A.R. Exome sequencing and analysis of 454,787 UK Biobank participants.Nature 20215 · doi ↗ · pubmed ↗
- 6Taliun D. Harris D.N. Kessler M.D. Carlson J. Szpiech Z.A. Torres R. Taliun S.A.G. Corvelo A. Gogarten S.M. Kang H.M. Pitsillides A.N. Le Faive J. Lee S. Tian X. Browning B.L. Das S. Emde A.K. Clarke W.E. Loesch D.P. Shetty A.C. Blackwell T.W. Smith A.V. Wong Q. Liu X. Conomos M.P. Bobo D.M. Aguet F. Albert C. Alonso A. Ardlie K.G. Arking D.E. Aslibekyan S. Auer P.L. Barnard J. Barr R.G. Barwick L. Becker L.C. Beer R.L. Benjamin E.J. Bielak L.F. Blangero J. Boehnke M. Bowden D.W. Brody J.A. Burchard E.G. Cade B.E. Casella J.F. Chalazan B. Chas · doi ↗ · pubmed ↗
- 7Hout V.C.V. Tachmazidou I. Backman J.D. Hoffman J.D. Liu D. Pandey A.K. Jauregui G.C. Khalid S. Ye B. Banerjee N. Li A.H. O’Dushlaine C. Marcketta A. Staples J. Schurmann C. Hawes A. Maxwell E. Barnard L. Lopez A. Penn J. Habegger L. Blumenfeld A.L. Bai X. O’Keeffe S. Yadav A. Praveen K. Jones M. Salerno W.J. Chung W.K. Surakka I. Willer C.J. Hveem K. Leader J.B. Carey D.J. Ledbetter D.H. Cardon L. Yancopoulos G.D. Economides A. Coppola G. Shuldiner A.R. Balasubramanian S. Cantor M. Nelson M.R. Whittaker J. Reid J.G. Marchini J. Overton J.D. · doi ↗ · pubmed ↗
