# Coronary heart disease risk prediction based on GAIN imputation and interpretable machine learning

**Authors:** Shulin Zhao, Baoyun Nan, Jun Guo, Wenkai Xu, Zhen Li

PMC · DOI: 10.3389/fgene.2025.1752811 · Frontiers in Genetics · 2026-01-21

## TL;DR

This study develops an interpretable machine learning model to predict coronary heart disease risk using imputed data and key clinical factors.

## Contribution

A novel approach combining GAIN imputation and SHAP-based XGBoost for interpretable and generalizable CHD risk prediction.

## Key findings

- XGBoost achieved an AUC of 0.9053 in predicting CHD risk.
- Key predictors included respiratory rate, age, hs-cTnI, and hypertension.
- The model supports both hospital EHR integration and mobile health applications.

## Abstract

Coronary atherosclerotic heart disease (CHD) is a leading cause of morbidity and mortality worldwide, making timely identification critical for improving patient prognosis. However, traditional imaging examinations are limited by high costs and patient selection bias, while existing prediction models often lack interpretability and generalization ability. This study aimed to develop a robust, interpretable machine learning approach to address these challenges.

This retrospective study analyzed hospitalized patients at Quzhou People’s Hospital from July 2021 to March 2025. Patients diagnosed with CHD were categorized as positive samples, while those without cardiovascular disease served as negative controls. The dataset integrated demographic data, blood biomarkers, and vital signs. A Generative Adversarial Imputation Network (GAIN) was utilized to handle missing values, and multiple machine learning models were constructed and compared for prediction performance.

Among the evaluated algorithms, the XGBoost model achieved superior performance on the test set with an Area Under the Curve (AUC) of 0.9053. To enhance clinical utility, the integration of SHAP (SHapley Additive exPlanations) values enabled both global and local interpretation of model decisions. Key predictive factors identified included mean respiratory rate during hospitalization, age, high-sensitivity troponin I (hs-cTnI), and hypertension.

The developed model demonstrates robust prediction performance combined with high clinical interpretability. Unlike traditional “black box” models, this approach clarifies the contribution of specific risk factors. Crucially, the tool is well-suited for dual deployment: serving as an automated screening tool integrated into hospital electronic health records (EHRs) and as a self-monitoring aid for individuals with underlying health conditions via mobile health applications.

## Linked entities

- **Diseases:** cardiovascular disease (MONDO:0004995)

## Full-text entities

- **Genes:** TNNI3 (troponin I3, cardiac type) [NCBI Gene 7137] {aka CMD1FF, CMD2A, CMH7, RCM1, TNNC1, cTnI}
- **Diseases:** cardiovascular disease (MESH:D002318), CHD (MESH:D003327), hypertension (MESH:D006973)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12867338/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12867338/full.md

## References

36 references — full list in the complete paper: https://tomesphere.com/paper/PMC12867338/full.md

---
Source: https://tomesphere.com/paper/PMC12867338