# Stratification of Pro-Atherogenic Phenotypes in Prediabetes Using Machine Learning

**Authors:** Liana Signorini, Waldemar Volanski, Ademir Luiz do Prado, Glaucio Valdameri, Mauren Isfer Anghebem, Vivian Rotuno Moure, Marcel Henrique Marcondes Sari, Geraldo Picheth, Fabiane Gomes de Moraes Rego

PMC · DOI: 10.3390/biomedicines14030651 · Biomedicines · 2026-03-13

## TL;DR

This study uses machine learning to identify different prediabetes groups based on heart disease risk using common blood tests.

## Contribution

A novel machine learning approach to stratify prediabetic individuals into pro-atherogenic subgroups using routine biomarkers.

## Key findings

- Triglycerides, AIP, and TyG index showed high accuracy in identifying pro-atherogenic phenotypes.
- Logistic regression using AIP and LDL-C/HDL-C achieved 93% accuracy in classifying risk groups.
- The k-means algorithm effectively stratified prediabetic individuals into two distinct cardiovascular risk clusters.

## Abstract

Background/Objectives: Prediabetes is a metabolic condition involving various phenotypes of glucose metabolism. Prediabetes increases the risk of heart disease, among other conditions. Hence, we employed machine learning tools to characterize phenotypes associated with cardiovascular disease using routine laboratory biomarkers. Methods: We processed laboratory records of over 1,000,000 de-identified individuals, resulting in a sample of 3024 individuals classified as prediabetic (fasting blood glucose 100–125 mg/dL combined with HbA1c 5.7–6.4%). Lipid profile parameters (total cholesterol [TC], HDL-C, LDL-C, and triglycerides) and associated indices (atherogenic index of plasma, Log10(TG/HDL-C), triglyceride–glucose index [TyG], TC/HDL-C, and LDL-C/HDL-C, among others) were analyzed using the k-means algorithm. Two groups emerged based on biomarker concentrations, a pro-atherogenic cluster (P-AC; n = 1113) and a less-atherogenic cluster (L-AC; n = 1911) for cardiovascular disease. Results: We assessed the performance of biomarkers in the P-AC and L-AC clusters using a receiver operating characteristic curve. Triglycerides (area under the curve [AUC] 0.977), AIP (AUC 0.978), and triglyceride–glucose index (AUC 0.974) showed sensitivity and specificity >90%. The TC/HDL-C (AUC 0.903) and LDL-C/HDL-C (AUC 0.865) indices also performed well, with sensitivity and specificity of 80%. Binomial logistic regression applied to the groups generated by k-means using the biomarkers AIP and LDL-C/HDL-C showed an AUC of 0.984 and accuracy above 93%. Conclusions: The k-means algorithm enabled the identification of a P-AC for cardiovascular disease among prediabetics using cost-effective laboratory biomarkers that are widely accessible in laboratories. Individuals classified as P-AC may benefit from differentiated treatment to minimize this factor.

## Linked entities

- **Diseases:** prediabetes (MONDO:0006920), heart disease (MONDO:0005267)

## Full-text entities

- **Genes:** AIP (AHR interacting HSP90 co-chaperone) [NCBI Gene 9049] {aka ARA9, FKBP16, FKBP37, PITA1, SMTPHN, XAP-2}
- **Diseases:** Prediabetes (MESH:D011236), heart disease (MESH:D006331), cardiovascular disease (MESH:D002318), Atherogenic (MESH:D050197)
- **Chemicals:** TG (MESH:D013866), LDL-C (-), cholesterol (MESH:D002784), Triglycerides (MESH:D014280), Lipid (MESH:D008055), TC (MESH:D013667), glucose (MESH:D005947)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13024570/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13024570/full.md

## References

60 references — full list in the complete paper: https://tomesphere.com/paper/PMC13024570/full.md

---
Source: https://tomesphere.com/paper/PMC13024570