# Development and validation of a machine learning model for on-site prediction of coronary heart disease in high-risk adults using clinical data

**Authors:** Liwen Mo, Hua Lin, Chengxuan Li, Lifei Yu, Decheng Lu

PMC · DOI: 10.1371/journal.pone.0334881 · 2025-11-13

## TL;DR

A machine learning model is developed and validated to predict coronary heart disease risk using clinical data, offering a more accurate on-site diagnostic tool compared to existing methods.

## Contribution

A two-layer machine learning model is developed and validated with real clinical data for on-site CHD prediction, showing improved accuracy over existing risk scores.

## Key findings

- The two-layer machine learning model achieved higher accuracy (0.79) compared to pooled cohort equations (0.59) for CHD prediction.
- Key predictors like age, diabetes, and hypertension were identified as important for CHD risk prediction.
- A simplified model with 20 predictors achieved 0.73 accuracy, balancing practicality and performance.

## Abstract

Risk of coronary heart disease (CHD) in a specific period of years can be assessed using scores calculated by models, such as pooled cohort equations (PCEs) and Framingham Risk Score. However, there are few studies on on-site estimation of CHD risk quantitatively with score calculation as auxiliary diagnosis. Nowadays, researchers introduce new technologies, such as machine learning, as effective CHD risk prediction models, but these models still need to be validated using real clinical data before promoting their use in real clinical settings.

The aim of this study is to predict CHD risk for high-risk population only using clinical data consisting of vital traits, lab measurement, diagnosis, medical device testing and medications. The prediction model can serve as an on-site quantitative indicator for the CHD risk of potential patients before diagnosis using coronary arteriography.

This work is designed as a retrospective study of a hospital-based cohort (The Second Affiliated Hospital of Guangxi Medical University), comprising 20,821 patients with CHD and 9,796 controls from 2017 to 2024. A two-layer machine learning model (TLML) is developed on the prediction results of the random forest and the gradient boosting decision tree to combine the merits of both models. The models were trained and validated with the clinical data in the cohort.

The TLML presented in this study can have a good accuracy (0.79, 95% CI 0.79–0.80), sensitivity (0.79, 95% CI 0.79–0.80) and specificity (0.79, 95% CI 0.79–0.79) for on-site CHD prediction. Compared with the PCEs (accuracy = 0.59, sensitivity = 0.58 and specificity = 0.60), the TLML shows remarkably better on-site CHD prediction performance. Predictor importance analysis results show that age, diabetes, antihypertensive medications, total bilirubin, hypertension, obstructive sleep apnea-hypopnea syndrome, red cell count, hemoglobin, cystatin C, retinol-binding protein, gender and low-density lipoprotein cholesterol level are the most important variables for on-site CHD prediction. All the features mentioned were reported to have relationship with CHD on some levels in previous studies. A reduced complexity model is also presented to provide decent CHD prediction with only 20 predictors to increase model practicality, achieving a prediction accuracy of 0.73.

The machine learning models presented in this study have the potential to become auxiliary on-site diagnostics tool of CHD because of its capability for accurate prediction and easy availability of all the required prediction variables.

## Linked entities

- **Diseases:** coronary heart disease (MONDO:0005010), diabetes (MONDO:0005015)

## Full-text entities

- **Genes:** CST3 (cystatin C) [NCBI Gene 1471] {aka ADLDWA, ARMD11, HEL-S-2}
- **Diseases:** obstructive sleep apnea-hypopnea syndrome (MESH:D020181), hypertension (MESH:D006973), diabetes (MESH:D003920), CHD (MESH:D003327)
- **Chemicals:** bilirubin (MESH:D001663)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12614581/full.md

---
Source: https://tomesphere.com/paper/PMC12614581