# Machine learning-based prediction of metabolic dysfunction-associated steatotic liver disease using National Health and Nutrition Examination Survey (NHANES) data

**Authors:** Yong Zhang, Xiang Liu, Xingqiang Zhang, Yangfan Fei, Xiaoxu Li, Aleksandra Klisic, Aleksandra Klisic, Aleksandra Klisic

PMC · DOI: 10.1371/journal.pone.0335656 · PLOS One · 2025-11-12

## TL;DR

This study uses machine learning to predict metabolic dysfunction-associated steatotic liver disease (MASLD) using health survey data, aiming to improve early diagnosis and intervention.

## Contribution

A novel XGBoost-based prediction model for MASLD using NHANES data with high accuracy and key risk factor identification.

## Key findings

- XGBoost outperformed other algorithms with an AUC of 0.8740 in predicting MASLD.
- Waist circumference and BMI were identified as pivotal risk factors through SHAP analysis.
- Recursive Feature Elimination selected 12 key features for the model.

## Abstract

With the global increase in obesity rates and lifestyle changes, metabolic dysfunction-associated steatotic liver disease (MASLD) has become a prevalent chronic liver disorder, affecting approximately 25% of the global population. This disease can progress to cirrhosis and liver cancer, posing a significant threat to public health. To facilitate early diagnosis and intervention, this study aims to develop an efficient and reliable prediction model for MASLD using machine learning algorithm.

This study included 9,232 participants aged 20 years and older from the 2017–2020 National Health and Nutrition Examination Survey (NHANES). After excluding individuals with frequent alcohol consumption, hepatitis B/C infection, those lacking liver ultrasound examinations, and samples with missing data, a total of 2,460 subjects were ultimately included. The dataset was split into training and testing sets in an 80:20 ratio. Five machine learning algorithms—XGBoost, Random Forest (RF), and Logistic Regression (LR), among others—were utilized to build prediction models, while Recursive Feature Elimination (RFE) was employed to identify key predictive factors.

Comparison of the five algorithms revealed that the XGBoost algorithm performed the best. Twelve key features were selected through Recursive Feature Elimination (RFE), and the model achieved an AUC of 0.8740 on the testing set, demonstrating excellent predictive accuracy and discriminative ability. SHAP plot analysis of the model showed that waist circumference, BMI, and other factors played a pivotal role in the prediction of MASLD.

The prediction model developed using the XGBoost algorithm and the 12 selected features demonstrates high efficiency and stability in assessing MASLD risk. This model offers innovative technical solutions and data-driven support for the clinical early identification of high-risk populations, with the potential to optimize and refine MASLD prevention and control strategies.

## Linked entities

- **Diseases:** metabolic dysfunction-associated steatotic liver disease (MONDO:0013209), cirrhosis (MONDO:0005155), liver cancer (MONDO:0002691)

## Full-text entities

- **Diseases:** MASLD (MESH:D008107), hepatitis B/C infection (MESH:D006509), metabolic dysfunction (MESH:D008659), chronic liver disorder (MESH:D058625), cirrhosis (MESH:D005355), liver cancer (MESH:D006528), obesity (MESH:D009765)
- **Chemicals:** alcohol (MESH:D000438)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12611120/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12611120/full.md

## References

33 references — full list in the complete paper: https://tomesphere.com/paper/PMC12611120/full.md

---
Source: https://tomesphere.com/paper/PMC12611120