# Explainable machine learning for stroke risk prediction: a comparative study with SHAP-based interpretation

**Authors:** Xiaoyu Tang, Min Tang, Wu Liu, Shaoyang Cui

PMC · DOI: 10.3389/fneur.2025.1716984 · Frontiers in Neurology · 2026-01-12

## TL;DR

This paper compares machine learning models for predicting stroke risk and uses SHAP to explain their predictions, highlighting key factors like hypertension and age.

## Contribution

The study introduces a comparative analysis of machine learning models for stroke prediction with SHAP-based interpretability, revealing key risk factors and model performance differences.

## Key findings

- Ensemble models and neural networks outperformed traditional algorithms in stroke risk prediction.
- SHAP analysis identified hypertension, average blood glucose level, and age as key predictors of stroke risk.
- Confusion matrices and PR curves showed variation in positive class recognition across models.

## Abstract

Stroke is one of the leading causes of death and disability worldwide, making early screening and risk prediction crucial. Traditional methods have limitations in handling nonlinear relationships between variables, class imbalance, and model interpretability.

Logistic regression (LR), random forest (RF), extreme gradient boosting (XGBoost), categorical boosting (CatBoost), multi-layer perceptron (MLP) neural network, and ensemble models were constructed and compared. Their performance in stroke risk prediction was systematically evaluated, and feature contributions were interpreted using SHapley Additive exPlanations (SHAP). Confusion matrices and Precision-Recall (PR) curves were used to compare the differences in recognition of the positive class (stroke patients) among the models, and training time was calculated to quantify resource consumption.

The ensemble model and neural network demonstrated superior overall predictive ability to traditional algorithms, with the MLP performing particularly well in terms of recall. SHAP results revealed that “hypertension,” “average blood glucose level,” and “age” were key influencing factors. Confusion matrices and PR curves indicated differences in positive classification among the models. Training time analysis provided a basis for resource assessment for subsequent deployment.

Machine learning methods have advantages in stroke risk prediction. Incorporating interpretability analysis can enhance the clinical credibility of the models, providing data and methodological reference for stroke risk stratification management and early warning.

## Linked entities

- **Diseases:** stroke (MONDO:0005098)

## Full-text entities

- **Diseases:** hypertension (MESH:D006973), Stroke (MESH:D020521), death (MESH:D003643)
- **Chemicals:** blood glucose (MESH:D001786)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12832496/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12832496/full.md

## References

44 references — full list in the complete paper: https://tomesphere.com/paper/PMC12832496/full.md

---
Source: https://tomesphere.com/paper/PMC12832496