# Machine learning approaches to predict hip fracture incidence: insights from the CHARLS dataset

**Authors:** Yuexin Li, Yihua Shi

PMC · DOI: 10.3389/fpubh.2025.1624843 · 2026-01-13

## TL;DR

This study uses machine learning to predict hip fracture risk in older adults using data from China, identifying key factors like age and lifestyle.

## Contribution

A novel machine learning model using CHARLS data to predict hip fracture risk with high accuracy and interpretability.

## Key findings

- Random Forest achieved the highest AUC of 0.93 in predicting hip fractures.
- Key predictors include MET, age, cognitive function, and lifestyle factors.
- The model performed well in both internal and external validation cohorts.

## Abstract

Hip fractures are a major health concern in the older adults, severely impacting patients’ quality of life and straining healthcare systems. With China’s aging population, their incidence is projected to increase. Thus, developing effective prediction models to identify high-risk individuals is essential for prevention.

The aim of this study was to develop and validate a reliable and accurate machine learning-based predictive model for hip fracture incidence to improve the prediction of the risk of hip fracture in community residents.

Data were obtained from the China Health and Retirement Longitudinal Study (CHARLS), encompassing 21,095 individuals aged 45 years and older, of whom 616 reported hip fractures. Baseline data from these participants were utilized to examine 34 metrics, including demographic characteristics, lifestyle, health status, and mental health and cognitive functioning scores. Ten machine learning algorithms, including Random Forest (RF), Adaptive Boosting (AdaBoost), and Decision Tree (DT) machine learning techniques were used to analyze and determine the optimal model. The performance of the predictive models was evaluated by area under the receiver operating characteristic curve (AUC), sensitivity, specificity and F1 score. The SHapley Additive exPlanation (SHAP) interpretation was utilized in identifying the key influencing factors and individual heterogeneity was explained through instance level analysis.

After addressing class imbalance using a class-weighting technique, we found that the Random Forest model performed the best, with an AUC of 0.93, with high sensitivity, specificity, and F1 score. Metabolic Equivalent of Task (MET), age, fall down, drinking, cognitive functioning scores, sleep duration of nap after lunch, residence, total sleep duration, and marital status were the key predictors. The model demonstrated favorable predictive performance in both the internal and external validation cohorts, indicating that the model is optimal.

The machine learning-based predictive model developed in this study demonstrated strong predictive performance for incident hip fractures over a 7-year period. By incorporating readily available, modifiable lifestyle factors, the model serves as a promising tool for identifying individuals at high risk. It provides a scientific basis for developing early intervention strategies, but requires further prospective validation before clinical implementation.

Machine learning approaches to predict hip fracture incidence using CHARLS dataset. The process includes developing predictive models for middle-aged and older people, using data from 18,879 individuals in 2020, with 616 having hip fractures. Findings feature ROC curves for different models, with RandomForest achieving the highest AUC of 0.93. Factors such as MET, age, and cognitive function scores influence fracture risk, visualized with SHAP values indicating their impact.

## Linked entities

- **Diseases:** hip fracture (MONDO:0005327)

## Full-text entities

- **Diseases:** Hip fractures (MESH:D006620)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12834802/full.md

---
Source: https://tomesphere.com/paper/PMC12834802