# Construction and application of a model for predicting athletes’ injury risk based on machine learning

**Authors:** Zhenhua Xu, WeiYa Sun, Haonan Qian, MengJin Yao

PMC · DOI: 10.1186/s12911-025-03331-x · BMC Medical Informatics and Decision Making · 2025-12-25

## TL;DR

This paper develops a machine learning model to predict injury risk in football players using training and recovery data, showing high accuracy and actionable insights.

## Contribution

The novel contribution is an ensemble ML model for injury prediction with interpretable insights using SHAP and LIME, validated in professional football players.

## Key findings

- Random forest models achieved 85.6% accuracy in predicting injury risk.
- Prior injury, training intensity, and recovery time were identified as key predictors.
- Explainable AI techniques provided interpretable insights for individualized risk assessment.

## Abstract

Accurate prediction of sports-related injuries is essential for optimizing athlete health and performance. This study evaluated machine learning (ML) models for injury risk in 300 male professional football players (ages 18–28) monitored over two competitive seasons (2021–2022). Injuries were defined as musculoskeletal conditions causing at least one missed training session or match, confirmed via ICD-10 diagnoses. Daily data on training workload, recovery, wellness, heart-rate variability, cumulative minutes played, and injury history were collected. Features were preprocessed with normalization, one-hot encoding, and selected via LASSO regression and recursive feature elimination. Missing data (< 3%) were imputed using multiple imputation by chained equations, and class imbalance was addressed with SMOTE and weighting. Logistic regression, decision tree, and random forest models were trained using 10-fold cross-validation and evaluated for accuracy, precision, recall, F1-score, and AUC. Random forests outperformed other models, achieving accuracy 85.6 ± 2.1%, precision 82.1 ± 1.9%, recall 80.3 ± 2.4%, F1-score 81.2 ± 2.2%, and AUC 90.5 ± 1.6%. Explainable AI techniques, including SHAP and LIME, identified prior injury, training intensity, and recovery time as the strongest predictors, enabling individualized risk assessment. These findings demonstrate that ensemble ML methods provide robust, interpretable, and actionable insights for injury prevention, supporting data-driven strategies to optimize training and reduce injury incidence. Future work should expand validation across multiple sports and integrate additional physiological and genetic factors to enhance predictive accuracy and generalizability.

Not applicable.

## Full-text entities

- **Diseases:** musculoskeletal conditions (MESH:D009140), Injuries (MESH:D014947)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12849558/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12849558/full.md

## References

2 references — full list in the complete paper: https://tomesphere.com/paper/PMC12849558/full.md

---
Source: https://tomesphere.com/paper/PMC12849558