# Evaluating Machine Learning Models for Classifying Diabetes Using Demographic, Clinical, Lifestyle, Anthropometric, and Environmental Exposure Factors

**Authors:** Rifa Tasnia, Emmanuel Obeng-Gyasi

PMC · DOI: 10.3390/toxics14010076 · Toxics · 2026-01-14

## TL;DR

This study uses machine learning to classify diabetes by combining clinical, lifestyle, and environmental factors, showing better performance with models like XGBoost and Random Forest.

## Contribution

The novel aspect is integrating environmental exposure biomarkers with traditional clinical features for diabetes classification using machine learning.

## Key findings

- Random Forest and XGBoost achieved high ROC AUC values of 0.891 and 0.885, respectively, after data balancing.
- Age, household income, and waist circumference were the most important features for diabetes classification.
- XGBoost showed the highest accuracy and F1-score in out-of-sample evaluation, while Random Forest had the highest sensitivity.

## Abstract

Diabetes develops through a mix of clinical, metabolic, lifestyle, demographic, and environmental factors. Most current classification models focus on traditional biomedical indicators and do not include environmental exposure biomarkers. In this study, we develop and evaluate a supervised machine learning classification framework that integrates heterogeneous demographic, anthropometric, clinical, behavioral, and environmental exposure features to classify physician-diagnosed diabetes using data from the National Health and Nutrition Examination Survey (NHANES). We analyzed NHANES 2017–2018 data for adults aged ≥18 years, addressed missingness using Multiple Imputation by Chained Equations, and corrected class imbalance via the Synthetic Minority Oversampling Technique. Model performance was evaluated using stratified ten-fold cross-validation across eight supervised classifiers: logistic regression, random forest, XGBoost, support vector machine, multilayer perceptron neural network (artificial neural network), k-nearest neighbors, naïve Bayes, and classification tree. Random Forest and XGBoost performed best on the balanced dataset, with ROC AUC values of 0.891 and 0.885, respectively, after imputation and oversampling. Feature importance analysis indicated that age, household income, and waist circumference contributed most strongly to diabetes classification. To assess out-of-sample generalization, we conducted an independent 80/20 hold-out evaluation. XGBoost achieved the highest overall accuracy and F1-score, whereas random forest attained the greatest sensitivity, demonstrating stable performance beyond cross-validation. These results indicate that incorporating environmental exposure biomarkers alongside clinical and metabolic features yields improved classification performance for physician-diagnosed diabetes. The findings support the inclusion of chemical exposure variables in population-level diabetes classification and underscore the value of integrating heterogeneous feature sets in machine learning-based risk stratification.

## Linked entities

- **Diseases:** diabetes (MONDO:0005015)

## Full-text entities

- **Diseases:** Diabetes (MESH:D003920)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12846289/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12846289/full.md

## References

44 references — full list in the complete paper: https://tomesphere.com/paper/PMC12846289/full.md

---
Source: https://tomesphere.com/paper/PMC12846289