# Machine Learning Framework for HbA1c Prediction: Data Enrichment, Cost Optimization, and Interpretability Through Stratified Regression and Multi-Stage Feature Selection

**Authors:** Mohamed Ezz, Majed Abdullah Alrowaily, Menwa Alshammeri, Alshaimaa A. Tantawy, Azzah Allahim, Ayman Mohamed Mostafa

PMC · DOI: 10.3390/diagnostics16040607 · 2026-02-19

## TL;DR

This paper presents a machine learning model that predicts HbA1c levels using a small set of clinical features, offering a cost-effective and interpretable solution for large-scale health assessments.

## Contribution

The study introduces a unified framework for continuous HbA1c prediction that integrates cost-efficient feature selection, stratified regression, and model explainability.

## Key findings

- The optimal model achieved R2 = 0.7161 using only 40 selected features from 224 candidates.
- Interpretability analysis showed clinically coherent relationships aligned with physiological expectations.
- The framework reduces feature dependency and enables cost-efficient HbA1c estimation in resource-limited settings.

## Abstract

Background: Measuring glycated hemoglobin (HbA1c) is essential for assessing long-term glycemic control, yet direct testing remains expensive and underutilized in many large-scale health surveys and resource-constrained settings. This study aims to (i) deliver a highly accurate and interpretable ML model for predicting HbA1c from routinely collected clinical, biochemical, and demographic data, (ii) reduce dependency on extensive laboratory panels by identifying a compact, cost-efficient subset of key predictors, and (iii) establish a transferable, explainable modeling framework applicable across chronic disease biomarkers. Unlike prior HbA1c prediction studies that focus primarily on classification or accuracy-driven models, this work introduces a unified framework for continuous HbA1c regression that jointly integrates cost-oriented feature parsimony, stratified regression validation, and explainability by design. Methods: We aggregated data from the National Health and Nutrition Examination Survey (NHANES) cycles 2007–2020, encompassing 66,148 records and 224 candidate features. We implemented a two-stage feature selection pipeline: Incremental Correlation Selection (ICS) to narrow the variable space, followed by Recursive Feature Elimination with Cross-Validation (RFECV) to isolate the most informative features. Model interpretability was assessed using partial dependence plots and feature importance analysis. Results: The optimal model, LightGBMRegressor with most-frequent imputation, achieved R2 = 0.7161, MAE = 0.334, MSE = 0.304, and MAPE = 5.56%, while using only 40 selected features. Interpretability analysis revealed clinically coherent relationships that align with physiological expectations. Discussion: The proposed framework maintains robust predictive performance while substantially reducing the number of required input features, enabling cost-efficient HbA1c estimation together with transparent, physiologically coherent model insights. By consolidating continuous HbA1c prediction, cost-aware feature selection, stratified evaluation, and explainability within a single pipeline are enhanced. Conclusions: This study advances beyond existing approaches and offers a practical blueprint for scalable biomarker estimation in population health and clinical decision-support applications. Its explainable, efficient, and generalizable design positions it as a strong candidate for clinical decision-support and population-health applications.

## Full-text entities

- **Genes:** SHROOM4 (shroom family member 4) [NCBI Gene 57477] {aka MRXSSDS, SHAP, shrm4}, PDP1 (pyruvate dehydrogenase phosphatase catalytic subunit 1) [NCBI Gene 54704] {aka PDH, PDP, PDPC, PDPC 1, PPM2A, PPM2C}, ALB (albumin) [NCBI Gene 213] {aka FDAHT, HSA, PRO0883, PRO0903, PRO1341}
- **Diseases:** obese (MESH:D009765), overweight (MESH:D050177), metabolic syndrome (MESH:D024821), injury to (MESH:D014947), inflammation (MESH:D007249), diabetes (MESH:D003920), prediabetes (MESH:D011236), adiposity (MESH:D018205), type 1 or type 2 diabetes (MESH:D003924), hypertension (MESH:D006973), heart attack (MESH:D009203), cardiovascular disease (MESH:D002318), insulin-resistance (MESH:D007333)
- **Chemicals:** blood glucose (MESH:D001786), insulin (MESH:D007328), lipid (MESH:D008055), Glucose (MESH:D005947), creatinine (MESH:D003404), ICS (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

13 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12939838/full.md

---
Source: https://tomesphere.com/paper/PMC12939838