# Improving Type 2 Diabetes Prediction: Comparative Evaluation of Machine Learning Classifiers Using Balanced Data from the AWI-Gen Cohort

**Authors:** Richmond Balinia Adda

PMC · DOI: 10.21203/rs.3.rs-8019155/v1 · Research Square · 2025-11-04

## TL;DR

This study evaluates machine learning models for predicting type 2 diabetes in northern Ghana, using balanced data and finding that models with clinical and lifestyle data can be effective in low-resource settings.

## Contribution

The study introduces a robust ML framework for T2DM prediction in African populations, addressing data leakage and class imbalance issues.

## Key findings

- An optimized XGBoost model achieved an AUC of 0.845 for T2DM prediction.
- Including glucose data increased model performance by 11.5%.
- Anthropometric and lifestyle variables provided strong predictions with an AUC of 0.783.

## Abstract

Type 2 diabetes mellitus (T2DM) is an escalating public health concern across Africa, but regionally tailored predictive models are scarce. Advances in machine learning (ML) offer potential for early identification, though previous research has been constrained by methodological issues such as data leakage, class imbalance, and overfitting, limiting clinical deployment, especially in digital health contexts.

This study analysed data from 2,010 participants in the H3Africa AWI-Gen cohort in northern Ghana to develop and evaluate ML-based prediction models tailored to African settings. Rigorous preprocessing steps, including handling class imbalance with SMOTE and excluding diagnostic biomarkers prone to target leakage, were applied. Eight ML classifiers underwent robust Bayesian hyperparameter optimisation. Model performance was assessed via stratified 5-fold cross-validation and confirmed through extensive sensitivity and calibration analyses.

The optimised XGBoost model yielded an AUC of 0.845 (95% CI: 0.812–0.878) and a sensitivity of 78.2% on unseen data. Including glucose as a predictor increased performance by 11.5%, underscoring the necessity of its exclusion to avoid biased evaluation. Models using only anthropometric and lifestyle variables (AUC = 0.783) demonstrated robust predictive capacity, with waist circumference, physical activity, and BMI standing out as the most stable predictors across analyses.

Our findings demonstrate that ML models constructed from routinely collected clinical and lifestyle data can attain clinically meaningful diabetes prediction suitable for digital health applications in low-resource African contexts. This study addresses prior methodological gaps and offers a data-driven framework that is both robust and clinically plausible for early T2DM detection, with potential implications for public health policy and digital screening programmes in similar populations.

## Linked entities

- **Diseases:** Type 2 diabetes mellitus (MONDO:0005148), T2DM (MONDO:0005148)

## Full-text entities

- **Diseases:** T2DM (MESH:D003924), diabetes (MESH:D003920)
- **Chemicals:** glucose (MESH:D005947)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12637827/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12637827/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/PMC12637827/full.md

---
Source: https://tomesphere.com/paper/PMC12637827