# Machine learning and Mendelian randomization identify key lifestyle factors in coronary heart disease: A NHANES-Based study

**Authors:** Yang yang Cui, Yonghong Zhang, Lang Zeng, Shikang Li, Xue Mei, Xiangmei Yang, Peng Zhou, Lijuan Xiong, Yijuan Huang, Jing Luo, Fenglin Wu, Rongchuan Yue

PMC · DOI: 10.1016/j.ijcrp.2025.200536 · International Journal of Cardiology. Cardiovascular Risk and Prevention · 2025-10-26

## TL;DR

This study uses machine learning and genetic evidence to identify lifestyle factors that both predict and cause coronary heart disease.

## Contribution

Combines machine learning with Mendelian randomization to identify causal lifestyle factors for coronary heart disease.

## Key findings

- SVM model achieved 83.4% accuracy and 0.909 AUC for CHD prediction.
- Mendelian randomization confirmed causal links for BMI, cholesterol intake, sleep, blood pressure, and smoking.
- Recursive feature elimination identified age, sex, BMI, and other lifestyle factors as top predictors.

## Abstract

This study aims to bridge the gap between predictive modeling and causal inference by utilizing lifestyle data from the National Health and Nutrition Examination Survey (NHANES) database to compare the predictive performance of multiple machine learning models for coronary heart disease (CHD). By incorporating Mendelian randomization, the study seeks to validate and identify the lifestyle variables with both predictive power and causal impact on CHD.

We extracted variables related to demographic characteristics and lifestyle from the NHANES database (2013–2018; n = 29,400). Recursive feature elimination (RFE) was employed to rank variable importance and determine the optimal feature subset. Subsequently, eight machine learning models-including Support Vector Machine (SVM), Neural Network (NN), Naive Bayes (NB), Extreme Gradient Boosting (XGBoost), Multilayer Perceptron (MLP), Generalized Linear Model (GLM), Adaptive Boosting (AdaBoost), and Decision Tree (DT)-were developed for CHD prediction. Model performance was evaluated using metrics such as accuracy, precision, sensitivity, specificity, recall, F1-score, and the Receiver Operating Characteristic (ROC) curve, with variable contributions visualized using Shapley Additive Explanations (SHAP). Additionally, Mendelian randomization (MR) was applied to distinguish associative from causal relationships, validating top predictors via Genome-Wide Association Study (GWAS)-derived genetic instruments.

RFE identified age, sex, fasting blood glucose, body mass index (BMI), total cholesterol (TC) intake, sleep duration, diastolic blood pressure, and smoking as the most significant predictors of CHD. Among the models, SVM outperformed DT, AdaBoost, XGBoost, NN, MLP, NB, and GLM. The SVM model achieved the highest performance with an accuracy of 83.4 % and an AUC value of 0.909, demonstrating clinically actionable predictive power. MR confirmed causal associations for five variables: BMI (OR: 1.01, P < 0.001), TC (OR: 1.01, P < 0.001), insomnia (OR: 1.03, P < 0.001), diastolic blood pressure (OR: 1.20, P < 0.001), and smoking (OR: 1.03, P < 0.001), while fasting glucose showed null causality (P > 0.05).

The SVM machine learning model, based on NHANES data, enables faster and more efficient prediction of CHD. The study identified age, sex, BMI, TC intake, sleep duration, diastolic blood pressure, and smoking as the lifestyle variables with the greatest impact on CHD. This dual approach advances precision prevention by combining predictive accuracy with genetic evidence.

•This study integrates machine learning and Mendelian randomization to improve coronary heart disease (CHD) risk prediction and causal inference using NHANES lifestyle data.•Mendelian randomization confirmed causal effects for five lifestyle factors, strengthening the evidence beyond association for BMI, cholesterol intake, sleep duration, diastolic blood pressure, and smoking.•This integrated approach enhances precision prevention of CHD by combining robust predictive modeling with genetic validation of causal risk factors.

This study integrates machine learning and Mendelian randomization to improve coronary heart disease (CHD) risk prediction and causal inference using NHANES lifestyle data.

Mendelian randomization confirmed causal effects for five lifestyle factors, strengthening the evidence beyond association for BMI, cholesterol intake, sleep duration, diastolic blood pressure, and smoking.

This integrated approach enhances precision prevention of CHD by combining robust predictive modeling with genetic validation of causal risk factors.

## Linked entities

- **Diseases:** coronary heart disease (MONDO:0005010)

## Full-text entities

- **Diseases:** CHD (MESH:D003327), insomnia (MESH:D007319)
- **Chemicals:** cholesterol (MESH:D002784), glucose (MESH:D005947), TC (-)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12637256/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12637256/full.md

## References

50 references — full list in the complete paper: https://tomesphere.com/paper/PMC12637256/full.md

---
Source: https://tomesphere.com/paper/PMC12637256