Expanding the Genetic Landscape of Craniofacial Anomalies Through Transcriptome-Wide Association Studies
Elly Brokamp, Alexandra Scalici, Tyne Miller-Fleming, David Wu, Wendy K. Chung, Monica H. Wojcik, Nancy J. Cox, Megan M. Shuey

TL;DR
This study expands the genetic understanding of craniofacial anomalies by identifying new genes linked to these conditions using transcriptome-wide association studies.
Contribution
The study introduces a novel approach to uncover new genes associated with craniofacial anomalies using transcriptome-wide association methods.
Findings
12 known craniofacial anomaly genes were associated with these anomalies in BioVU, and 18 in eMERGE.
53 new genes not previously linked to craniofacial anomalies were identified.
Many known craniofacial genes were more strongly associated with other congenital anomalies like heart and nervous system defects.
Abstract
Craniofacial anomalies are common congenital anomalies that significantly contribute to infant mortality and life-long health problems. Studies of craniofacial anomalies have identified several genetic causes, but focus on rare, Mendelian presentations. Despite this, current diagnostic genetic testing only identifies a causal genomic variant in ~ 25% of affected individuals. This low diagnostic yield for Mendelian conditions may relate to oligogenic and polygenic risks for craniofacial anomalies. In this study we sought to use large electronic health record systems including many patients with craniofacial anomalies to determine whether we could identify patterns of genetic associations with craniofacial anomalies and known associated genes. We performed transcriptome-wide association studies that evaluated the association between genetically predicted gene expression and craniofacial…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5- —Vanderbilt University Medical Center Synthetic Derivative
- —CTSA grant from National Center for Advancing Translational Sciences/National Institutes of Health
- —NHGRI (Group Health Cooperative/University of Washington)
- —NHGRI (Bringham and Women’s Hospital)
- —NHGRI (Vanderbilt University Medical Center)
- —NHGRI (Cincinnati Children’s Hospital Medical Center)
- —NHGRI (Mayo Clinic)
- —NHGRI (Geisinger Clinic)
- —NHGRI (Columbia University Health Sciences)
- —NHGRI (Children’s Hospital of Philadelphia)
- —NHGRI (Northwestern University)
- —NHGRI (Vanderbilt University Medical Center serving as the Coordinating Center)
- —NHGRI (Partners Healthcare/Broad Institute)
- —NHGRI (Baylor College of Medicine)
- —NHGRI (Meharry Medical College)
- —NIHNCATS
- —CTTSA TL1
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCraniofacial Disorders and Treatments · Cleft Lip and Palate Research · Folate and B Vitamins Research
Background
Craniofacial anomalies (CFAs) are a common group of congenital anomalies (CAs) caused by the abnormal development of skull and/or facial bones. The most common CFAs, cleft lip with or without cleft palate (CL/P) and cleft palate alone, make up one third of all CAs in the United States. They occur in ~ 16 in 10,00 births within the United States and are a major contributor to infant mortality.^1^ Likewise, craniosynostosis, another common CFA, is estimated to affect 1 in 2100–2500 births and can result in abnormal brain growth causing neurological dysfunction.^2^ The frequency, contribution to infant mortality, and life-long associated health problems associated with CFAs makes understanding of genetic underpinnings of the risk and presentations of CFAs essential.
Despite a great deal of research into the genetic etiology of CFAs, clinical genetic testing for CFAs has a relatively low diagnostic yield. There is little understanding of what drives the variable expression and incomplete penetrance of CFA syndromes. Combined diagnostic approaches of karyotype, chromosomal microarray, and exome sequencing can detect the genetic cause of 22.5% of individuals with orofacial clefts.^3,4^ Similarly, comprehensive clinical diagnostic genetic testing of individuals with craniosynostosis can detect a genetic cause in about a quarter (25%) of affected individuals.^5^ These results suggest that, at best, for every four patients with a CFA who access comprehensive genetic testing, one will receive a diagnostic results for the genetic cause of their condition. Even when a diagnostic test identifies a causal variant in a Mendelian gene, there is often incomplete penetrance and/or variable expressivity, suggesting the presence of complex genomic and environmental interactions that impact phenotypic manifestations. As sequencing technologies have advanced and sample sizes have increased, several studies have examined how polygenic variation impacts the penetrance of rare variants associated with CFAs such as cleft lip and palate.^6,7^
To better understand how polygenicity impacts CFAs, large sample sizes are essential. Resources such as Vanderbilt University Medical Center’s BioVU and the Electronic Medical Records and Genomics (eMERGE) network cohort, which link genotype data to electronic health record (EHR) data, could contribute to these efforts.^8,9^ We therefore undertook the first use of these resources to evaluate the genetically predicted changes in gene expression of CFA-associated Mendelian genes as well as the genetically predicted gene expression of individuals with CFAs.
Methods
Identification of Individuals with Craniofacial and Other Congenital Anomalies
We used the phecodeX version of phecodes, systematic groupings of International Classifiers of Disease (ICD) billing codes, which are part of the CM (congenital malformation) chapter of phecodes to identify individuals with CAs.^10^ The Tongue, Mouth, and Pharynx parent code (CM_754) and the Skull, Face, and Jaw parent code (CM_755) within the CM chapter were used to identify individuals with CFAs. Figure 1A demonstrates how ICD9 and ICD10 billing codes are collapsed into one phecode. All available billing codes from an individuals’ entire medical history are used in the construction of phecodes. We mapped the CM phecodes from the ICD code data using the PheWAS R package.^11^
Identification in Vanderbilt University Medical Center
Vanderbilt University Medical Center’s EHR-system provides a unique resource with a wealth of deidentified health information in the synthetic derivative (SD). The SD contains ~ 3.5 million individuals, with ~ 300k individuals having genetic information in BioVU, VUMC’s deidentified EHR-linked biobank. 6,131 Individuals with CAs and CFAs were identified from Vanderbilt University Medical Center (VUMC)’s de-identified EHR-linked DNA biobank, BioVU.^8,12^ This work is deemed non-human subjects work by the VUMC IRB and received all necessary approvals. We restricted our cases and controls used in analyses to a medical home population (n = 1,275,576), a high medical use population defined as individuals with an ICD-9 or 10 billing code collected from at least three unique visit dates over three or more years. This medical home definition ensures a study population with substantial phenotypic information within their EHR. All genetic data for included participants in BioVU were genotyped using the Illumina MEGAEX Array and the included population was restricted to participants of European genetic ancestry based on clustering in principal component analysis (PCA) using genetic data from the 1000 Genomes Project as reference populations, as previously described.^13^ This restriction was done to minimize confounding results and maximize sample size. Future work to include individuals of non-European genetic ancestry is ongoing.
Identification in eMERGE
We identified individuals with CFAs from the Electronic Medical Records and Genomics (eMERGE) Network, which combines DNA biorepositories with EHR data.^9,14^ The eMERGE GWAS cohort includes 64,536 individuals from five institutions (VUMC, Columbia University Irving Medical Center, Northwestern Medical Center, Mass General Brigham, and Cincinnati Children’s Hospital Medical Center (CCHMC)). Analyses were restricted to participants of European genetic ancestry from CCHMC in the eMERGE GWAS cohort. Because 75% of CFA cases in eMERGE were from CCHMC we restricted our secondary analysis to this population. ICD billing codes for each individual are provided as part of available data to eMERGE consortium researchers. We mapped the CM phecodes from the ICD code data using the PheWAS R package as previously described.^11^
Identifying Genes with Known CFA Associations
We curated a list of genes associated with CFAs, using previously published genome-wide association study (GWAS) results and the Concert Genetics’ registry of clinical genetic tests.^15,16^ This testing registry aggregates all clinically available genetic panels, lists all the genes included in each panel, and allows for comparison of included genes between different companies’ panels. These panels reflect a largely comprehensive list of genes with a known monogenic CFA association. A certified genetic counselor searched Concert Genetics’ Test Registry using the search term “craniofacial” and compiled a list of all the unique genes that are offered on the resulting Craniofacial Panel Tests. These panels included both syndromic and non-syndromic CFA genes. Supplemental Table 1 lists curated known CFA-associated genes.
Defining CFA cases and controls
To identify individuals in BioVU and eMERGE that have a CFA, we used phecodes from the “Tongue Mouth and Pharynx” and the “Skull Face and Jaw” phecodeX Chaps. 1^0^ We identified individuals that had at least two instances of a CFA phecode of the same parent code and/or family head code Supplemental Table 2. Figure 1B demonstrates the breakdown of the parent code, family head code, and specific code portions of a phecode. We defined controls as individuals who did not have a single congenital anomaly phecode in their record.
Defining CA cases and controls
To define whether individuals in BioVU and eMERGE had CAs we used phecodes from the congenital malformations phecode X Chap. 1^0^ For our cases, we identified individuals with at least two instances of a specific CA phecode (Fig. 1B, Supplemental Table 3). Controls were defined as individuals with no CA phecodes in their record.
Transcriptome Wide Association Study of Individuals with CFAs
Using machine learning models such as PrediXcan, UTMOST, and Joint Tissue Imputation (JTI), we calculated genetically predicted gene expression (GPGE) using GTEx (version 8) as a reference population in both BioVU and eMERGE.^17,18,19,20^ Using the single best performing model out of these three models for each gene tissue pair (r^2^ > 0.01), we conducted a transcriptome-wide association study (TWAS). We tested the association of CFA status with GPGE in a logistic regression model adjusting for age, sex, number of visits, and the first ten principal components of ancestry.
Association of Craniofacial Anomaly Genes with Other Congenital Anomalies
To determine how variation in GPGE of known CFA genes may increase risk for other CAs, we conducted gene-based phenotype-wide association studies (PheWAS) of the known CFA genes.^21,22,23^ Of the 391 known CFA genes, we had quality prediction for 341 (r^2^ > 0.1). We tested whether a diagnosis of a CA is associated with GPGE of a known CFA gene adjusting for age, sex, the first ten principal components of ancestry, and number of visits in a logistic regression mode.
Results
Identification of individuals with craniofacial and other congenital anomalies within the EHR
Within the entire VUMC medical home population, including individuals without genotype data, there are 19,509 individuals with a CFA. About a quarter of these individuals, 4,051 (21.1%) have a second CA in another organ system (Fig. 1C). At VUMC, there are 694 individuals of European ancestry with a CFA who have available genotype information (248 individuals with a “Tongue, Mouth, and Pharynx” phecode CA and 446 individuals with a “Skull, Face, and Jaw” phecode CA). From the eMERGE GWAS cohort individuals of European ancestry at CCHMC, there are 384 individuals with a CFA (113 individuals with a Tongue, Mouth, and Pharynx phecode CA and 322 individuals with a Skull, Face, and Jaw phecode CA) (Table 1).
TWAS of individuals with CFAs identifies genes not previously associated with CFAs
Our TWAS of CFAs in both BioVU and eMERGE did not identify any statistically significant associations that passed the highly conservative Bonferroni multiple testing correction (0.05 divided by the number of genes with GPGE per tissue). When using a less stringent p-value threshold, we identified 1,261 genes in BioVU and 1,260 gene in eMERGE that were significantly associated (p < 0.05). We compared these genes in both study populations to our curated list of known CFA genes and found that the majority of our curated CFA genes (93.6%) demonstrate no level of significant association, even at a permissive p < 0.05 level, with CFAs in either BioVU and eMERGE (Fig. 2A). In total, fewer than 1% of significant genes in either BioVU or eMERGE were part of the curated previously known CFA-associated gene list. This included 11 significant genes from BioVU (0.90%) and 14 (1%) from eMERGE (Table 2). There was no overlap in significant genes known to association with a CFA in either BioVU or eMERGE.
Additionally, we compared the 1,261 and 1,260 significant associations (p < 0.05) in each study population and found 53 significant gene associations that were shared between the BioVU and eMERGE TWAS results (Fig. 2B). These 53 genes were not part of the curated gene list and were not identified previously as associated with CFAs or craniofacial structure. By using a more stringent p-value threshold (p < 0.001), we identified 231 genes in BioVU and 257 genes in eMERGE associated with CFAs (Table 2). From this more stringent cutoff we identified two genes, VAV1 (*164875) (BioVU p = 0.009; eMERGE p = 0.006) and CYP3A7 (*605340) (BioVU p = 0.009; eMERGE p = 0.009) that are shared between both cohorts (Fig. 2B). Neither of these genes were associated with CFAs or any human disease but have been implicated in development.^24,25^
CA-wide association study illustrates that known CFA genes are associated with a broad range of CAs across multiple organ systems
Because CFAs often co-occur with other CAs, we sought to evaluate whether the curated CFA associated gene list demonstrated more significant associations with other CAs. To do this we tested for the association between the GPGE of 341 CFA-associated genes from the curated list and any CA phecode in both BioVU and eMERGE study populations. Assuming common variation in these genes identified from clinical testing panels for CFAs were specific to craniofacial development, we would expect to find an enrichment of craniofacial phenotypes in the skull face and jaw as well as the mouth, tongue and pharynx phecodeX chapters. While we identified a few significant associations with these phecodes at the least stringent significance threshold (p < 0.05) (Fig. 3A and B), the most significant associations for any of the CFA-associated genes were identified in other organ system CAs. This analysis illustrates that a broad range of phenotypes spanning multiple organ systems are associated with common variation in CFA-associated genes (Fig. 4). Among our gene-based results, we found that the GPGE of GLI2 (*165230) is associated with 18 different CA phecodes in BioVU (p < 0.05). These include significant associations with CA phecodes from four organ systems, 12 heart, 1 eye, 1 musculoskeletal, and 1 respiratory, as well as 2 situs inversus CA phecodes (Fig. 5, Table 3).
Discussion
Because most previous genetic studies of CFAs focus on rare variant/monogenic causes of disease, this work investigated the GPGE of individuals with CFAs and how the GPGE of known CFA genes relate to CAs. We perform a TWAS for individuals with a CFA and a CA-wide PheWAS for the GPGE of genes with a known CFA association. Overall, the results of these analyses suggest that in addition to rare variants, polygenic variation impacting gene expression may contribute to many CAs and may play a role in the penetrance and expressivity of CA syndromes.
As sample sizes for genetic studies have increased, so has our understanding of the complexity of the genetic architecture driving phenotypes such as CFAs through our expanded understanding of how common variation contributes to the diversity of facial morphology and shape.^15,26,27^ Studies of CFAs and polygenic architecture have identified that there are shared genetic features between cleft lip and palate and the size of facial features as well as common variation that affects the penetrance of known cleft clip and palate variants in PDGFRA (*173490).^6,7^ Additionally, researchers have proposed that Mendelian CFAs are extreme phenotypes on a continuum of phenotypic variation in facial morphology and that integrating common variation into the study of these phenotypes is essential to understanding their genetic drivers.^27,28^ All of these studies leverage GWAS and polygenic scores to examine common variations contributions to CFAs. However, using an approach that utilizes GPGE allows us to capture the predicted effects of common variant gene expression in a TWAS to identify genes whose altered GPGE is associated with CFA status. By conducting a gene-based analysis of CFAs, we can obtain more biologically interpretable results for our CFA associations.
Out of the 391 genes known to cause CFAs in a monogenic fashion, over 90% are not significantly associated with CFAs in a transcriptome-wide fashion (p < 0.01), highlighting the gap in knowledge when studying CFAs that present in a Mendelian or syndromic fashion. The two genes whose GPGE showed a significant association (p < 0.01) with CFAs in both cohorts, VAV1 (*164875) and CYP3A7 (*605340), have not yet been associated with any human disease, but do have known roles in fetal development. VAV1 (*164875) is a proto-oncogene that is involved with hematopoiesis and T and B cell signaling. The well-established relationship between disrupted oncoprotein signaling, cancer, and congenital anomalies suggests that differential expression of the oncogene VAV1 (*164875) could be driving the development of some CFAs.^29, 30,24^ Additional research on the differential expression of proto-oncogenes in individuals with congenital anomalies could give additional insight into why those with congenital anomalies are more likely to develop cancer. CYP3A7 (*605340) encodes a cytochrome P450 “super family” enzyme, CYP3A7 (*605340), which metabolizes a diverse array of endogenous and exogenous substances, including prescription medications that many pregnant individuals need to maintain their own health such as carbamazepine, diltiazem, caffeine, and nifedipine. CYP3A7 (*605340) is primarily expressed in fetal liver tissue, being detected as early as 50 days of gestation and decreasing in expression until 24 months of age postnatally.^31,32^ Additionally, CYP3A7 (*605340) plays a key role in the production of a critical pregnancy hormone, estriol, which has been shown to be an important epigenetic modifier in mice fetuses. Variable expression of CYP3A7 (*605340) could have dramatic effects on fetal development, and further research can assess the complex interactions of environmental risks and genetic predispositions to CFAs and other CAs.
The observations that only 10% of the previously identified CFA associated genes had a significant GPGE association with CFAs and the strong overlap (21.1%) of individuals with CFAs having other CAs drove us to analyze what CAs were significantly associated with the GPGE of the curated CFA genes. These known CFA genes show more significant GPGE associations with other organ system CAs than the CFAs themselves, suggesting that there are shared genetic and environmental susceptibilities across CAs. The many genetic syndromes that contain multiple CAs support the idea of shared risk factors for many types of CAs. Taken together these results suggest that polygenic variation in CFA-associated genes may relate to developmental changes more broadly and are not necessarily restricted to craniofacial development.
Many genetic syndromes that cause multiple types of CAs demonstrate variable expression and incomplete penetrance, yet the factors causing variable expression/incomplete penetrance is not well understood. For example, ~ 65% of individuals with 22q11.2 deletion syndrome have a congenital heart defect (CHD) and ~ 67% have a palate abnormality.^33,34^ As another example, we noted that the predicted expression of GLI2’s (*165230) is significantly (p < 0.05) associated with many congenital heart defects (CHDs), such as congenital pulmonary valve stenosis (p = 0.0002), congenital malformations of heart valves (p = 0.0002), congenital insufficiency of the aortic valve (p = 0.0005), and ten others (Table 3). GLI2 (*165230) is classified as a known CFA gene due to its association with two congenital malformation syndromes, Culler-Jones syndrome and Holoprosencephaly 9.^35^ Both syndromes can present with several congenital anomalies, such as cleft lip/palate, microcephaly, polydactyly, but to date CHDs are not associated with either syndrome.^36,37^ Yet GLI2 (*165230) does have a well-established role in cardiomyogenesis and there is a group of individuals with CHDs that have GLI2 (*165230) missense variants shown to dysregulate sonic hedgehog signaling, which is crucial for fetal development.^,38,39,40^ The expression of GLI2 (*165230) in the developing heart suggests CHDs could possibly be a phenotypic expansion for the two GLI2-related congenital anomaly syndromes and suggests examining its role in cardiac development. Differences in gene expression could be one factor causing variable expressivity and reduced penetrance that is characteristic of many congenital anomaly syndromes.
One of the main limitations of studying CFAs at biobank scale is that they have a relatively low prevalence and are caused by large-effect rare variants. Both issues affect our statistical power to detect phenotype associations. One way that our analysis attempted to address this limitation is by conducting gene-based analyses and defining our CFA phenotype across multiple phecodes that describe different types of CFAs. Despite using a less stringent p-value threshold, conducting gene-based analyses such as TWAS provides a more interpretable biological unit than single variant analyses. While the analyses in this paper are underpowered, they still provide biologically and clinically meaningful results.^22,23^
Throughout this study we leveraged a set of known CFA genes that were compiled from clinical diagnostic testing for CFAs as well as a GWAS of face shape that was curated and reviewed by a certified genetic counselor. While we tried to make this gene set as comprehensive as possible, we are limited by which genes are currently have a well-established associated with CFAs. The main goal of the work in this study is to try to better understand the genetic drivers of CFAs.
Overall, our results support that both rare and common genetic variants in CFA Mendelian genes may contribute to a variety of CAs and highlights the complexities of the CA phenotypes, suggesting there are shared underlying genetic and environmental risk factors. Further research of CAs through GPGE could help better explain variable presentations and penetrance of CA syndromes.
Conclusions
TWAS of individuals with CFAs in the BioVU and eMERGE cohorts identified relatively few previously identified CFA-associated genes. The two genes whose GPGE had the strongest association in both cohorts have potential roles in the complex genetic drive of CAs. The predicted expression of genes that have a known Mendelian-association with CFAs are more often significantly associated with other types of CAs in the BioVU and eMERGE cohorts. For example, the predicted expression of GLI2, which is associated with a syndrome that can include CFAs, is significantly associated with several CHDs. The results of both analyses suggest that there are overlapping polygenic causes of many types of CAs and that with further research may help explain the variability in how CA syndromes can present.
Supplementary Material
This is a list of supplementary files associated with this preprint. Click to download.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Mai C. T. National population-based estimates for major birth defects, 2010–2014. Birth Defects Res. 111 (18), 1420–1435. 10.1002/bdr 2.1589 (2019).31580536 PMC 7203968 · doi ↗ · pubmed ↗
- 2Boulet S. L., Rasmussen S. A. & Honein M. A. A population-based study of craniosynostosis in metropolitan Atlanta, 1989–2003. Am. J. Med. Genet. A. 146A (8), 984–991. 10.1002/ajmg.a.32208 (2008).18344207 · doi ↗ · pubmed ↗
- 3Jin H. Chromosomal microarray analysis in the prenatal diagnosis of orofacial clefts: Experience from a single medical center in mainland China. Med. (Baltim). 97 (34), e 12057. 10.1097/MD.0000000000012057 (2018). · doi ↗
- 4Yan S. Exome sequencing improves genetic diagnosis of congenital orofacial clefts. Front. Genet. 14, 1252823. 10.3389/fgene.2023.1252823 (2023).37745857 PMC 10512413 · doi ↗ · pubmed ↗
- 5Wilkie A. O. M., Johnson D. & Wall S. A. Clinical genetics of craniosynostosis. Curr. Opin. Pediatr. 29 (6), 622–628. 10.1097/MOP.0000000000000542 (2017).28914635 PMC 5681249 · doi ↗ · pubmed ↗
- 6Yu Y. Polygenic risk impacts PDGFRA mutation penetrance in non-syndromic cleft lip and palate. Hum. Mol. Genet. 31 (14), 2348. 10.1093/hmg/ddac 037 (2022).35147171 PMC 9307317 · doi ↗ · pubmed ↗
- 7Howe L. J. Investigating the shared genetics of non-syndromic cleft lip/palate and facial morphology. PLOS Genet. 14 (8), e 1007501. 10.1371/journal.pgen.1007501 (2018).30067744 PMC 6089455 · doi ↗ · pubmed ↗
- 8Roden D. M. Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clin. Pharmacol. Ther. 84 (3), 362–369. 10.1038/clpt.2008.89 (2008).18500243 PMC 3763939 · doi ↗ · pubmed ↗
