Machine Learning Framework for HbA1c Prediction: Data Enrichment, Cost Optimization, and Interpretability Through Stratified Regression and Multi-Stage Feature Selection

Mohamed Ezz; Majed Abdullah Alrowaily; Menwa Alshammeri; Alshaimaa A. Tantawy; Azzah Allahim; Ayman Mohamed Mostafa

PMC · DOI:10.3390/diagnostics16040607·February 19, 2026

Machine Learning Framework for HbA1c Prediction: Data Enrichment, Cost Optimization, and Interpretability Through Stratified Regression and Multi-Stage Feature Selection

Mohamed Ezz, Majed Abdullah Alrowaily, Menwa Alshammeri, Alshaimaa A. Tantawy, Azzah Allahim, Ayman Mohamed Mostafa

PDF

Open Access

TL;DR

This paper presents a machine learning model that predicts HbA1c levels using a small set of clinical features, offering a cost-effective and interpretable solution for large-scale health assessments.

Contribution

The study introduces a unified framework for continuous HbA1c prediction that integrates cost-efficient feature selection, stratified regression, and model explainability.

Findings

01

The optimal model achieved R2 = 0.7161 using only 40 selected features from 224 candidates.

02

Interpretability analysis showed clinically coherent relationships aligned with physiological expectations.

03

The framework reduces feature dependency and enables cost-efficient HbA1c estimation in resource-limited settings.

Abstract

Background: Measuring glycated hemoglobin (HbA1c) is essential for assessing long-term glycemic control, yet direct testing remains expensive and underutilized in many large-scale health surveys and resource-constrained settings. This study aims to (i) deliver a highly accurate and interpretable ML model for predicting HbA1c from routinely collected clinical, biochemical, and demographic data, (ii) reduce dependency on extensive laboratory panels by identifying a compact, cost-efficient subset of key predictors, and (iii) establish a transferable, explainable modeling framework applicable across chronic disease biomarkers. Unlike prior HbA1c prediction studies that focus primarily on classification or accuracy-driven models, this work introduces a unified framework for continuous HbA1c regression that jointly integrates cost-oriented feature parsimony, stratified regression validation,…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes3

SHROOM4 PDP1 ALB

Proteins3

Species1

Homo sapiens(human · species)

Chemicals6

blood glucose insulin lipid Glucose creatinine ICS

Diseases13

obese overweight metabolic syndrome injury to inflammation diabetes prediabetes adiposity type 1 or type 2 diabetes hypertension heart attack cardiovascular disease insulin-resistance

Figures13

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare · Imbalanced Data Classification Techniques · Machine Learning in Healthcare