A Categorical ANCOVA Approach to Severity Endophenotype-Specific Genome-Wide Association Studies in Childhood Asthma
Shraddha Piparia, Parham Hadikhani, John Ziniti, Julian Hecker, Alvin T. Kho, Rinku Sharma, Juan C. Celedón, Michael J. McGeachie, Scott T. Weiss, Kelan G. Tantisira

TL;DR
This study shows that analyzing all asthma severity types together uncovers more genetic links than traditional methods, improving understanding of asthma subtypes.
Contribution
A new statistical method using ANCOVA improves discovery and replication of genetic associations in asthma severity subtypes.
Findings
ANCOVA identified 244 genome-wide significant SNPs in CAMP, with six loci replicated in GACRS.
Logistic regression found fewer significant associations and only one replication in GACRS.
Modeling all subtypes together reveals biologically meaningful signals missed by pairwise approaches.
Abstract
Objective: Asthma is a complex and heterogeneous syndrome, making it hard to predict disease progression and suitable treatments. One strategy for reducing this uncertainty is to define genetic subtypes, or endophenotypes, that capture shared biological mechanisms. Most genome-wide studies, however, compare one subgroup against all others within a single cohort and rarely replicate their findings. We aimed to determine whether simultaneously modeling all asthma endophenotypes improves the discovery and replication of genetic associations compared with the standard one-versus-rest approach. Methods: We analyzed common single-nucleotide polymorphisms (SNPs) in the Childhood Asthma Management Program (CAMP) using an analysis of covariance (ANCOVA) across all severity-related endophenotypes, adjusting for age, sex, and ancestry principal components. SNPs showing genome-wide significance…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3- —National Institutes of Health, United States
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAsthma and respiratory diseases · Genetic Associations and Epidemiology · Genomics and Rare Diseases
1. Introduction
Asthma is a complex condition that affects about 30 million Americans and about 300 million people worldwide [1]. Asthma is a multifactorial syndrome arising from diverse combinations of genetic variations and environmental exposures and comprising unique molecular mechanisms that result in marked disease heterogeneity [2,3,4]. Such heterogeneity is evidenced by substantial differences in disease triggers, progression, and treatment responses among affected individuals [5,6], underscoring the need to define and understand underlying endotypic differences. Conventional genome-wide association studies (GWAS) usually include all asthma patients in a single case group [7], implicitly assuming a shared genetic architecture that likely misses loci that contribute to specific subtypes of asthma. Recognizing these clinically meaningful subgroups has prompted efforts to incorporate endotypic information into genetic and genomic studies [8,9,10], highlighting the need for stratified or subtype-aware analytical methods.
Several efforts have been made to classify asthma patients into distinct severity subtypes to uncover the underlying biological mechanisms for severity [11,12,13]. However, GWAS of asthma severity endotypes remains largely unexplored, and existing studies usually reduce endotypes back into a case-control framework rather than treating them as true categorical outcomes. Since endotypes represent discrete categories, not a continuous trait, linear or binary comparisons are limited, particularly if more than one group shares the same genetic influences. The existing GWAS of endotypes have then taken a one-versus-one, case-control approach by testing each endotype separately against all other subjects with a simple logistic regression model [14,15,16,17,18,19]. This pairwise strategy multiplies the number of hypotheses, thus inflating the multiple-testing burden, making it harder to discover subtype-specific loci. Multivariate or multi-level association models have been increasingly recognized as more powerful alternatives to pairwise approaches [20,21,22]. Analytical approaches such as analysis of variance (ANOVA) and its covariate-adjusted form, analysis of covariance (ANCOVA), which can model all endophenotypes simultaneously and better capture shared or overlapping genetic influences, have yet to be fully explored in this context.
In this study, we investigate genetic variants that drive biological differences between clinically defined asthma endophenotypes. The endophenotypes used in this analysis were previously defined using principal component analysis (PCA) of baseline clinical features across three independent pediatric asthma cohorts (CAMP, PACT, and GACRS) [23]. Subjects were grouped into five ordinal categories (Q1–Q5) based on quintiles of their PC1 scores, which captured major variation in asthma severity and atopy. Across cohorts, PC1 loadings were dominated by markers of atopic status (IgE, positive skin test, etc.), lung function (FEV1/FVC ratio, peak expiratory flow, etc.), and related demographic factors (age, sex, age of onset, etc.), together explaining the majority of variance in baseline clinical presentation. Q1 represented children with milder disease and higher lung function, whereas Q5 included those with greater atopic burden, lower lung function, and other markers of more severe asthma.
This PCA-based framework provided a reproducible, quantitative definition of asthma endophenotypes that remained consistent across independent pediatric cohorts and was shown to predict corticosteroid treatment response. Similar dimension-reduction and clustering frameworks have previously been used to delineate asthma phenotypes and inflammatory endotypes, demonstrating that latent clinical structure can reveal biologically meaningful subgroups [24,25]. To identify which clinical endophenotypes are enriched or depleted for the risk allele, we apply ANCOVA. In genetic analysis, ANCOVA evaluates whether allele frequencies differ across multiple endophenotypes while adjusting for covariates such as ancestry or clinical factors. Unlike traditional one-versus-rest or pairwise logistic models, which compare each contrast separately, ANCOVA tests all groups simultaneously. This framework allows us to capture multi-cluster heterogeneity and directly evaluate subtype-specific differences in a statistically efficient way. ANCOVA not only reduces within-group error variance and increases statistical power, but also allows adjustment for key clinical and demographic covariates that could confound genetic associations. By avoiding the inflation inherent to multiple pairwise tests and instead evaluating all endophenotypes in a unified framework, ANCOVA facilitates the identification of both subtype-specific and shared loci, providing a biologically coherent understanding of asthma heterogeneity.
2. Methods
2.1. Study Populations
The analysis was first conducted in the Childhood Asthma Management Program (CAMP, discovery cohort) and then replicated in the Genetics of Asthma in Costa Rica Study (GACRS, replication cohort). CAMP [26,27], was a multicenter, randomized clinical trial of inhaled corticosteroids to prevent severe asthma exacerbations in 1041 children aged 5 to 12 years, with mild to moderate persistent asthma. GACRS [28] was an observational cross-sectional study of 1165 Costa Rican children aged 6 to 14 years with physician-diagnosed asthma and at least two respiratory symptoms or a history of asthma attacks in the previous year. Written parental consent and/or the subject’s assent were obtained for each study protocol and ancillary genetic testing. Study protocols were approved by local Institutional Review Boards at each recruitment site for both studies, and by the Institutional Review Board of Brigham and Women’s Hospital.
2.2. Genotyping and Quality Control
SNPs were assayed on high-density Illumina arrays (Illumina Inc., San Diego, CA, USA). CAMP subjects were genotyped on 550 K v3 and 610 Quad BeadChips while GACRS participants were genotyped on OmniExpress and Omni2.5 BeadChips. Array-specific quality control (QC) was conducted in PLINK v1.9 [29]. Samples were excluded for call-rate < 95%, absolute heterozygosity deviations > 0.20 from the cohort mean, sex discrepancies, Mendel error rates in pedigrees, or excess relatedness identified by pairwise identity-by-state sharing. Variants were removed if monomorphic, exhibited a call-rate that was <95%, had a minor-allele frequency (MAF) < 5%, or violated Hardy Weinberg equilibrium ( ). Cleaned datasets from each chip were merged using PLINK v1.9 and phased and imputed on the Michigan Imputation Server to the Haplotype Reference Consortium (HRC) panel [30]. Post-imputation QC was performed in PLINK v2.0 [29]. We excluded variants with a call rate < 95% or had a minor-allele frequency (MAF) < 5%, and removed samples with genotype missingness > 5%. Our analysis included 792 subjects from CAMP and 1030 subjects from GACRS after QC with 3384590 SNPs common between CAMP and GACRS.
2.3. Statistical Analysis
Five endophenotypes of asthma were defined using multivariate clinical characteristics [23]. Briefly, principal component analysis (PCA) was applied to baseline clinical features, and subjects were grouped into five ordinal categories (Q1–Q5) according to quintiles of their first principal component (PC1) scores. In prior work, this PC1 axis captured major variation in asthma severity and atopy and demonstrated reproducible endotype structure across three pediatric cohorts, including CAMP and GACRS [23]. For reproducibility, PCA loadings from CAMP are provided in Supplementary Table S1. Researchers may standardize their clinical variables and multiply them by this loading matrix to reproduce PC scores and assign endophenotypes. We treat the five endophenotypes as the five levels of a single categorical factor and test them simultaneously within one model. This framework accounts for between-cluster distinctions that are not apparent in binary splits and maintains the latent continuum captured by the PC1-derived endophenotype bins, while also reducing the need for multiple pairwise tests and the associated multiple-testing burden. We tested whether allele dosages differed across endophenotype groups using an ANCOVA model:
Covariates such as age, sex, and ancestry principal components (PC_1_–PC_10_) were included to account for stratification and avoid confounding effects from demographic factors. Model fit was summarized using the F statistics, which indicate the ratio of between-group variance to within-group variance. To identify which endophenotypes were driving overall associations, we conducted post-hoc pairwise group comparisons using Tukey’s Honestly Significant Difference (HSD) test. Genome-wide significance thresholds were applied to ANCOVA results, and Tukey’s contrasts were used to report endophenotype-specific allele frequency differences. The top replicated ANCOVA SNPs were then subjected to one-vs.-rest logistic regression:
run separately for five endophenotypes. To further evaluate whether categorical ANCOVA signals reflected an underlying continuous severity axis, we applied ordinal regression using severity scores to test for monotonic risk allele frequency trends across endotypes. All statistical analyses were conducted in R (version 4.4.2) using the aov(), lm(), and glm(). The F statistic, odds ratio, and p-values are reported.
2.4. Machine Learning Prediction
To evaluate predictive performance, we trained both Elastic Net and XGBoost classifiers using SNPs that passed a significance threshold of in CAMP (1976 variants), followed by LD clumping (windows of 250 SNPs with a step size of 50 variants, and retained variants with ) to yield 247 independent SNPs. Classifiers were trained in CAMP with cross-validation and assessed using one-vs.-rest receiver operating characteristic (ROC) across the five endophenotypes. For both models, hyperparameters were tuned using five-fold cross-validation within the CAMP cohort to optimize for AUC. The elastic net model was tuned for a range of regularization strengths (C) and L1 ratios, while the XGBoost model was tuned over key parameters including learning rate, tree depth, and regularization terms (L1/L2). Performance was then evaluated in the independent GACRS cohort.
3. Results
Table 1 summarizes the baseline characteristics of CAMP and GACRS asthma cohorts, stratified by endophenotypes. While the mean age was significantly progressively higher along with the order of endophenotypes in both cohorts, the FEV1 pre-bronchodilator percent predicted (preBDFEV1PP) was significantly progressively lower as the order of the endophenotypes increased, consistent with increased asthma severity. There was no significant difference in the participants’ sex across endophenotypes in either CAMP or GACRS. In CAMP, non-Hispanic white participants were more likely to be in endophenotype 1, while non-Hispanic Black participants were more likely to be classified in endophenotype 5.
The multivariable ANCOVA analysis was adjusted for age, sex, and the top ten genetic ancestry PCs in the CAMP cohort. Figure 1 shows the Manhattan plot from the ANCOVA models, and Table 2 summarizes the six LD-independent SNPs that reached genome-wide significance in CAMP and their replication statistics in GACRS. In the discovery cohort, CAMP, 244 SNPs were found to meet the genome-wide significance threshold of . After LD clumping, six unique SNPs remained significant (rs10964536, rs28892326, rs2823880, rs10086065, rs12448208, rs2754324) with ANCOVA F values 10.3–12.0 ( ). The tables also show the F score measuring overall heterogeneity, the odds ratio, and post hoc comparisons revealing significant group differences. Applying the identical ANCOVA model to GACRS confirmed a nominal association ( ) for all six loci.
In CAMP (Table 2), the top signal was rs10964536, located on chromosome 9 (F value = 12.03). Endophenotype-specific contrasts indicated that allele dosage differed significantly in endophenotypes 4 vs. 1, 5 vs. 1, and 4 vs. 3. This indicates that carriers are enriched in the higher-order endophenotypes. In CAMP and GACRS, these high-order endophenotypes correspond to PC1 quintiles Q4–Q5, which were characterized by lower baseline lung function, higher IgE and eosinophil counts, and greater atopic burden compared with Q1–Q2 [23]. The same marker replicated in GACRS with F = 3.23 (p = 0.0121) and a significant 5 vs. 1 contrast. For rs28892326, rs2823880, and rs10086065, ANCOVA in CAMP yielded F values 11.9–11.4 with significant 4 vs. 1 and 5 vs. 1 comparisons that survived Tukey correction (p ). Each locus replicated nominally in GACRS ( ); rs2823880 showed the strongest replication (F = 5.12, ) with multiple post-hoc contrasts (4 vs. 1, 4 vs. 2, 4 vs. 3). rs12448208 also separated 2 vs. 1 and 4 vs. 1 in CAMP and displayed nominal replication in GACRS. Similarly, rs2754324 distinguished endophenotype 4 from 1 and 3 in CAMP, with nominal replication for 5 vs. 3 in GACRS.
Across the six top SNPs (five endophenotypes = 30 possible one-vs.-rest tests), logistic regression (Table 3 showed that 12 out of 30 contrasts (40%) reached a significance of 0.05 before multiple-test correction. After Bonferroni adjustment (alpha = 0.05/30 = 0.0016), only 4 (16%) contrasts remain significant. For logistic regression after Bonferroni correction in GACRS for 30 contrasts ( ), only rs2823880 remained statistically significant for endophenotype 4 vs. rest. These results underscore that conventional one-vs.-rest logistic regression captures only a fraction of the associations detected by ANCOVA, which tests all endophenotypes simultaneously and therefore retains greater power to identify multi-group differences.
The minor-allele frequencies differed substantially across the endophenotypes and reflected the post-hoc contrasts identified by ANCOVA (Figure 2). For all six loci in CAMP, the minor-allele dosage was significantly higher in high-severity endophenotypes. Notably, rs10964536 shows a strong enrichment in endophenotypes 4 and 5 compared with 1 as well as for 4 vs. 3, with delta MAF upto 12%. Similarly, rs28892326 (delta MAF∼8–9%), rs2823880 (delta MAF∼9%), and rs10086065 (delta MAF 7.5–10%) also show enrichment for endophenotype 4 and 5 vs. 1. rs12448208 (delta MAF∼9%) shows an early severity enrichment for detecting endophenotypes 2 and 4 compared with 1. rs2754324 (delta MAF 10–11%) showed strongest variation in endophenotypes 4 compared with 1 and 3. In GACRS, five of the six loci showed the same directional trend toward higher minor-allele frequency in more severe endophenotypes, with rs12448208 as the only exception, which did not replicate and showed an opposite pattern.
To further verify that categorical ANCOVA signals follow a continuous severity axis, we tested the six genome-wide discovery loci with an ordinal trend model using PC1 scores. We noticed an increase in each SNP’s risk allele in the same direction as asthma severity in both cohorts, with modest p-values (range of 0.001–0.1), supporting a severity trend and the increased power from discrete endophenotype grouping. In CAMP, the Elastic Net classifier achieved good discrimination across severity endophenotypes with per-class AUCs ranging from 0.73 to 0.87 and an average AUC of 0.81 (Figure 3). XGBoost performed comparably for some classes but was less stable overall (AUCs – ) In contrast, external evaluation in GACRS showed no generalization, with both Elastic Net and XGBoost models performing at chance level across all classes (AUCs – ).
4. Discussion
In this study, we evaluated the genetics of asthma severity endophenotypes using a novel ANCOVA approach. Specifically, we employed ANCOVA to test whether allele frequencies differ across clinically defined severity endophenotypes and contrasted this with a more conventional one-versus-rest logistic regression framework. Applying these methods in two independent pediatric asthma cohorts (CAMP and GACRS), we found that ANCOVA detected 244 genome-wide significant SNPs in CAMP, with six LD-independent loci, all of which replicated in GACRS. By comparison, the one-versus-rest logistic models identified fewer significant contrasts, highlighting the improved sensitivity and power of ANCOVA for capturing genetic determinants across multiple severity endophenotypes simultaneously. Further LD-clumping resulted in six unique loci that confirmed cross-study screening. These six loci were then subjected to five logistic regressions using a one-vs.-rest approach, which revealed only four significant contrasts, while post-hoc group differences from ANCOVA revealed 13 significant contrasts in CAMP. In GACRS, eight contrasts were identified by ANCOVA and only one survived multiple-testing correction. This confirms that ANCOVA outperforms multiple one-vs.-rest logistic tests in power and cross-cohort consistency due to fewer tests, contrasts encoded within one model, and better use of within-cohort heterogeneity. Every ANCOVA significant contrast resulted in ≥7.5% delta MAF in CAMP and ≥4% delta MAF in GACRS. Moreover, five out of six loci showed the same upward MAF gradient toward severe endophenotypes in both cohorts, except for SNP rs12448208. Risk allele frequencies increase with severe asthma endophenotypes, and MAF patterns reinforce the genetic findings across endophenotypes.
The clinically defined severity endophenotype was used to identify allele-frequency differences using ANCOVA with Tukey post-hoc group contrasts. The F score captures any heterogeneity across endophenotypes, and Tukey contrasts then identify driving endophenotypes. One-vs.-rest logistic regressions do not reveal group contrasts and suffer from increased burden due to multiple-testing correction. ANCOVA with categorical endophenotypes increases the discovery power and reveals the varying contrasts. Noteably, an attempt to address dimensionality and power [31] uses logit-transformed allele frequencies and models their interaction to explain cluster differences via simulations. While this method offers advantages in scalability, it lacks statistical inference and biological interpretability. By contrast, our approach employs ANCOVA to explicitly test varying allele frequencies across endophenotypes, producing biologically interpretable inferences and enabling cross-cohort validation. This suggests that ANCOVA can detect allele-frequency differences across groups and, by reducing the number of tests, may mitigate some of the sample size limitations inherent to endophenotype analyses.
Across the six loci, five map within or near transcribed genes or open reading frames, while one SNP (rs10964536) lies in an intergenic region with no clear functional annotation. Among these, rs28892326 is located within DGKI, which regulates airway smooth muscle proliferation and remodeling [32]. The remaining loci fall within genes or non-coding regions with limited or indirect links to airway or immune biology and therefore require further study to clarify their relevance. The variant rs2823880 is located within MIR99AHG, the host gene of the miR-99a/let-7c/miR-125b-2 cluster [33,34]. The variant, rs12448208, is located near SNX20 [35]. Overall, these patterns suggest that a subset of loci has plausible biological relevance but requires additional validation to determine their role in asthma.
Beyond association testing, we explored whether the top SNPs could stratify clinical endophenotypes using machine-learning classifiers. In CAMP, Elastic Net models trained on 247 LD-pruned SNPs achieved good within-cohort discrimination (average AUC = 0.81), while XGBoost performed less consistently (Figure 3). However, when applied to the external GACRS cohort, neither method was generalized, with AUCs close to 0.5 in all endophenotypes. Notably, using preselected SNPs based on marginal association ( ) may contribute to model overfitting, as feature selection was informed by the same dataset used for model training. Although cross-validation was applied, this practice can inflate apparent predictive performance within the discovery cohort and limit generalizability across independent samples.
Our study had limited statistical power due to the small sample size of the endophenotype groupings, and larger, diverse asthma cohorts will therefore be required both to replicate these associations and to determine the true clinical relevance of the implicated loci across populations. Further, severity endophenotypes were derived using baseline clinical features, and longitudinal reassignment may change genotype-severity endophenotype mapping. Our results indicate that risk alleles are more common in severe asthma endophenotypes, and a severity-weighted polygenic risk score incorporating these endophenotype-specific variants may enhance the prediction of exacerbation risk and corticosteroid response. Additionally, SNP feature selection was performed on the full CAMP dataset prior to model training, which introduces the possibility of information leakage and may inflate within-cohort cross-validated performance. Although we attempted external validation in GACRS, the lack of model generalization highlights the limited transferability of CAMP-derived predictors and underscores the need for larger cohorts and nested feature-selection frameworks in future work.
In summary, our study indicates that ANCOVA applied to clinically defined asthma severity endophenotypes provides a complementary approach to logistic regression, enabling detection of group-level allele-frequency differences in settings where sample size is limited. By testing allele-frequency differences across categorical endophenotypes, ANCOVA identified genome-wide associations that replicated across cohorts and showed gradients consistent with asthma severity. Several loci were located in genes and regulatory regions related to inflammation, airway remodeling, and immune function. Elastic Net captured within-cohort variation but did not generalize across cohorts, underscoring limitations in transferability. Larger and more diverse studies, together with functional analyses, will be needed to confirm these findings and clarify their potential relevance for risk prediction and treatment.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1CDC.gov CDC—Asthma—Data and Surveillance—Asthma Surveillance Data 2024 Available online: https://www.cdc.gov/asthma-data/about/most-recent-asthma-data.html(accessed on 24 June 2025)
- 2Kuruvilla M.E. Lee F.E.H. Lee G.B. Understanding asthma phenotypes, endotypes, and mechanisms of disease Clin. Rev. Allergy Immunol.20195621923310.1007/s 12016-018-8712-130206782 PMC 6411459 · doi ↗ · pubmed ↗
- 3Fainardi V. Esposito S. Chetta A. Pisi G. Asthma phenotypes and endotypes in childhood Minerva Medica 20211139410510.23736/S 0026-4806.21.07332-833576199 · doi ↗ · pubmed ↗
- 4Kaur R. Chupp G. Phenotypes and endotypes of adult asthma: Moving toward precision medicine J. Allergy Clin. Immunol.201914411210.1016/j.jaci.2019.05.03131277742 · doi ↗ · pubmed ↗
- 5Haldar P. Pavord I.D. Shaw D.E. Berry M.A. Thomas M. Brightling C.E. Wardlaw A.J. Green R.H. Cluster analysis and clinical asthma phenotypes Am. J. Respir. Crit. Care Med.200817821822410.1164/rccm.200711-1754 OC 18480428 PMC 3992366 · doi ↗ · pubmed ↗
- 6Torgerson D.G. Ampleford E.J. Chiu G.Y. Gauderman W.J. Gignoux C.R. Graves P.E. Himes B.E. Levin A.M. Mathias R.A. Hancock D.B. Meta-analysis of genome-wide association studies of asthma in ethnically diverse North American populations Nat. Genet.2011438878922180454910.1038/ng.888PMC 3445408 · doi ↗ · pubmed ↗
- 7García-Sánchez A. Isidoro-García M. García-Solaesa V. Sanz C. Hernández-Hernández L. Padrón-Morales J. Lorente-Toledano F. Dávila I. Genome-wide association studies (GWAS) and their importance in asthma Allergol. Immunopathol.20154360160810.1016/j.aller.2014.07.00425433770 · doi ↗ · pubmed ↗
- 8Conrad L.A. Cabana M.D. Rastogi D. Defining pediatric asthma: Phenotypes to endotypes and beyond Pediatr. Res.202190455110.1038/s 41390-020-01231-633173175 PMC 8107196 · doi ↗ · pubmed ↗
