# Machine learning models incorporating genotype and ancestry improve severe asthma risk prediction

**Authors:** Nahian Tahmin, Lokesh K Chinthala, Franco Leonel Marsico, Silvia Buonaiuto, Akram Mohammed, Annette Carlisle, Yadu Gautam, Vincenza Colonna, Tesfaye B. Mersha, Robert L Davis, Anahita Khojandi

PMC · DOI: 10.1038/s41598-025-24080-x · Scientific Reports · 2025-11-17

## TL;DR

This study shows that combining genetic and ancestry data with machine learning improves predictions of severe asthma treatment response in African American children.

## Contribution

A novel machine learning stacking technique integrating SNPs and local ancestry data is proposed to enhance clinical outcome prediction.

## Key findings

- Stacked pipelines integrating SNP and LA data achieved an AUC of 0.729 ± 0.048, significantly better than standalone models.
- LA data alone performed comparably to SNP data alone, but both capture distinct sources of variation.
- The integration of SNP and LA data provides complementary insights for predicting ICS response in severe asthma.

## Abstract

This study proposes a novel machine learning (ML)-based stacking technique that integrates Single Nucleotide Polymorphisms (SNPs) and inferred local ancestry (LA) to improve predictive accuracy in clinical outcomes. Asthma, particularly severe asthma (SA) with poor response to inhaled corticosteroids (ICS), serves as the case study to illustrate this approach. Using data from the Biorepository and Integrative Genomics (BIG) Initiative, which includes whole-exome sequenced data from a self-reported African American pediatric cohort (N=248), we develop an ML framework to predict ICS response. After SNP data preprocessing and LA estimation, we employ stratified 10-fold cross-validation, creating base pipelines for SNP and LA data, which are then combined in stacked pipelines to assess the effectiveness of integrating these distinct data types. The stacked SNP pipeline yields an AUC of 0.693 ± 0.066 and the stacked LA pipeline yields an AUC of 0.625 ± 0.103. The integration of LA with SNP data significantly improves predictive performance, boosting the AUC to 0.729 ± 0.048 (paired t-test p-value = 0.005). Pipelines using LA data alone shows comparable performance to those using SNP data alone. However, the most important contributing features are distinct between LA and SNP data demonstrating that these data types capture distinct sources of variation and could provide complementary insights. This study highlights the potential of stacking ML pipelines, based on feature selection techniques and along with logistic regression and random forest predictive models, to integrate SNP and LA data. Such holistic approach has the promise to improve predictive performance of medication response in complex conditions like SA. This approach has broader implications for advancing personalized medicine through the effective use of multifactorial data.

## Linked entities

- **Diseases:** asthma (MONDO:0004979)

## Full-text entities

- **Diseases:** SA (MESH:D045169), Asthma (MESH:D001249)
- **Chemicals:** ICS (-)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12623762/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12623762/full.md

## References

7 references — full list in the complete paper: https://tomesphere.com/paper/PMC12623762/full.md

---
Source: https://tomesphere.com/paper/PMC12623762