Literature‐informed ensemble machine learning for three‐year diabetic kidney disease risk prediction in type 2 diabetes: Development, validation, and deployment of the PSMMC NephraRisk model

Ayla M. Tourkmani; Turki J. Al‐Harbi; Ahmad Abdullah Alghamdi; Ibrahim M. Youzghadli; Faris Saad Alosaimi; Ahmed Y. Azzam

PMC · DOI:10.1111/dom.70385·December 15, 2025

Literature‐informed ensemble machine learning for three‐year diabetic kidney disease risk prediction in type 2 diabetes: Development, validation, and deployment of the PSMMC NephraRisk model

Ayla M. Tourkmani, Turki J. Al‐Harbi, Ahmad Abdullah Alghamdi, Ibrahim M. Youzghadli, Faris Saad Alosaimi, Ahmed Y. Azzam

PDF

Open Access

TL;DR

This paper presents a machine learning model to predict the risk of diabetic kidney disease in type 2 diabetes patients, incorporating social and demographic factors, and demonstrates its accuracy and fairness.

Contribution

The novel contribution is a literature-informed ensemble model for DKD/DN risk prediction that includes social determinants and is ready for deployment.

Findings

01

The model achieved excellent discrimination with an AUROC of 0.852 and near-perfect calibration.

02

It demonstrated superior net benefit compared to treat-all strategies and no algorithmic bias across subgroups.

03

The model was deployed as an interactive web-based application for practical use.

Abstract

Diabetic kidney disease (DKD) and diabetic nephropathy (DN) affect around 40% of diabetic patients but lack accurate risk prediction tools that include social determinants and demographic complexity. We developed and validated an ensemble machine learning model for three‐year DKD/DN risk prediction with deployment readiness. We analysed 18 742 eligible adult type 2 diabetic patients from Prince Sultan Military Medical City (PSMMC) registry between 2019 and 2024 in Riyadh, Saudi Arabia. Using temporal patient‐level splitting, we developed a stacked ensemble model (LightGBM + CoxBoost) with several features including multiple literature‐informed imputed variables including family history, non‐steroidal anti‐inflammatory drug (NSAID) use, socioeconomic deprivation, diabetic retinopathy severity, and antihypertensive medications, imputed via Bayesian multiple imputation by chained…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases6

diabetic kidney disease diabetic nephropathy type 2 diabetes diabetic retinopathy diabetic DKD

Figures7

Click any figure to enlarge with its caption.

Model calibration evaluation for predicted versus observed risk. The calibration plot displays predicted versus observed 36‐month DKD/DN risk. Dashed diagonal line: Perfect calibration (slope = 1.0, intercept = 0.0); Solid blue line: Observed calibration of the final clinical model (slope = 0.98, intercept = −0.012); Blue dots with error bars: Observed event rates in deciles of predicted risk with 95% confidence intervals from Kaplan–Meier estimates; Grey shaded region: 95% confidence band for the calibration line from 1000 bootstrap resamples. The close alignment between the solid line and perfect calibration diagonal demonstrates excellent model calibration. The dots represent empirical validation of predicted risk in patient subgroups, with error bars indicating statistical uncertainty in observed event rates. Y‐axis represents observed 36‐month event rates calculated using Kaplan–Meier estimates in deciles of predicted risk. X‐axis represents mean predicted 36‐month risk within each decile.

Model development progression for temporal performance metrics.

Tables6

TABLE 1. Baseline demographics and characteristics of registry cohort patients.

Characteristic	Value	Missing, n visits (%)
Total cohort
Total number of patients	18 742	—
Total number of recorded visits	42 143	—
Demographics
Age, years		0 (0.0%)
Mean ± SD	58.8 ± 11.4
Median (IQR)	59 (51–66)
Gender, n (%)		0 (0.0%)
Female	23 935 (56.8%)
Male	18 208 (43.2%)
Nationality, n (%)		0 (0.0%)
Saudi	41 978 (99.6%)
Non‐Saudi	165 (0.4%)
Laboratory parameters
eGFR, mL min⁻¹ 1.73 m²		3 (0.0%)
Mean ± SD	90.0 ± 56.9
Median (IQR)	92 (78–102)
ACR, mg g⁻¹		3972 (9.4%)
Mean ± SD	92.2 ± 420.2
Median (IQR)	17 (8–32)
HbA1c, %		16 016 (38.0%)
Mean ± SD	8.1 ± 1.6
Median (IQR)	8.0 (7.0–9.0)
Serum phosphorus, mg dL⁻¹		1686 (4.0%)
Mean ± SD	3.8 ± 0.6
Median (IQR)	3.7 (3.4–4.1)
FGF‐23, pg mL⁻¹		6321 (15.0%)
Mean ± SD	68.3 ± 45.2
Median (IQR)	54.7 (38.2–82.5)
Anthropometric measurements
BMI, kg m⁻²		1257 (3.0%)
Mean ± SD	32.2 ± 6.6
Median (IQR)	32 (28–36)
Waist circumference, cm ^a		18 453 (43.8%)
Mean ± SD	103.4 ± 14.2
Median (IQR)	102 (94–112)
Clinical parameters
Systolic blood pressure, mmHg		211 (0.5%)
Mean ± SD	136.8 ± 18.4
Median (IQR)	135 (124–148)
Diastolic blood pressure, mmHg		211 (0.5%)
Mean ± SD	78.3 ± 11.2
Median (IQR)	78 (70–86)
Diabetes duration, years		337 (0.8%)
Mean ± SD	11.2 ± 7.8
Median (IQR)	10 (5–16)
Comorbidities, n (%)
Hypertension	28 280 (67.2%)	0 (0.0%)
Cardiovascular disease	4214 (10.0%)	126 (0.3%)
Current smoker	5057 (12.0%)	843 (2.0%)
Former smoker	3371 (8.0%)	843 (2.0%)
Baseline CKD stage, n (%)
No CKD (eGFR ≥90)	23 277 (55.2%)	3 (0.0%)
Stage 1–2 (eGFR 60–89)	14 560 (34.6%)
Stage 3a (eGFR 45–59)	2953 (7.0%)
Stage 3b (eGFR 30–44)	970 (2.3%)
Stage 4 (eGFR 15–29)	380 (0.9%)

TABLE 2. External studies characteristics for synthetic variable priors.

Study name	Data source	Sample size	Design	Mean age ± SD (years)	Female (%)	Ethnicity	Diabetes (%)	Follow‐up	Primary outcome	Prior contribution	Effect size used
Qu et al. 2024 ⁴⁶	UK Biobank	33 441	Prospective cohort	NR	NR	Predominantly White	23.8	Median 12.3 years	Incident DKD/DR/DN	Mediterranean diet effects	HR 0.64–0.79 for AMED score
Castillo‐García et al. 2024 ⁴⁷	UK Biobank	517 917	Prospective cohort	56.6 ± 8.1	55.0	94% White	5.9	2.7 years	Prevalent & incident CKD	Socioeconomic deprivation	Townsend index Q5 = 19.9%
Weldegiorgis et al. 2024 ⁴⁸	CPRD (UK primary care)	1 397 573	Population cohort	48.6 ± 15.7	58.2	Predominantly White	7.2	7.5 years (IQR 5.2–10.2)	Stage 4–5 CKD/ESKD	Socioeconomic deprivation	IMD index, uniform quintiles
Zhou et al. 2023 ⁴⁹	Chinese T2D cohort	19 858	Prospective cohort	NR	NR	Han Chinese	100	Mean 1.6 years	Incident DKD	Statin protective effects	HR 0.72 (0.62–0.83)
Borrelli et al. 2023 ⁵⁰	Italian CKD registry	906	Prospective cohort	NR	NR	Italian	NR	Median 7.8 years	DKD Progression/CV Events	Nocturnal BP patterns	HR 1.82–2.40 for nondipping
Filippatos et al. 2021 ⁵¹	FIDELIO‐DKD (Global RCT)	5674	Randomised controlled trial	66.6 ± 9.1	29.0	72% White	100	2.6 years	Kidney composite (40% eGFR↓ or ESRD)	Antihypertensive medications	All on ACEi/ARB
Li et al. 2021 ⁵²	Meta‐analysis (10 cohorts)	635	Systematic review	46.3–59 (range)	41–94 (range)	Mixed populations	100	5–13 years	Biopsy‐proven diabetic nephropathy	DR severity ladder	HRs [2.9, 5.8, 10.2, 16.6]
Hsing et al. 2021 ⁵³	Taiwan tertiary centre	841	Cross‐sectional study	68.2 ± 13.8	67.7	Han Chinese	100	Cross‐sectional	DR grade vs. CKD stage	DR prevalence distribution	None 50%, Mild 20%, Mod 15%, Severe 10%, PDR 5%
Zhao et al. 2021 ⁵⁴	Meta‐analysis (China)	13 743	Meta‐analysis	NR	NR	Chinese populations	100	Cross‐sectional	Association of obesity with DKD	Abdominal obesity effects	SMD 0.17–0.27 for WC/VFA
Heerspink et al. 2020 ⁵⁵	DAPA‐CKD RCT	4304	Randomised controlled trial	NR	NR	Multinational	67.5	Median 2.4 years	DKD Progression	SGLT2i protective effects	HR 0.56 (0.45–0.68)
Perkovic et al. 2019 ⁵⁶	CREDENCE RCT	4401	Randomised controlled trial	NR	NR	Multinational	100	Median 2.62 years	DKD Progression/CV Events	SGLT2i protective effects	HR 0.66 (0.53–0.81)
Liao et al. 2019 ⁵⁷	Meta‐analysis	203 337	Systematic review	NR	NR	Mixed populations	Mixed T1D and T2D (we used only T2D data and findings)	≥1 year	Incident DKD	Smoking effects on DKD risk	HR 1.38–1.63 by pack‐years
Yamanouchi et al. 2019 ⁵⁸	Japanese T2D cohort	232	Prospective cohort	NR	NR	Japanese	100	Median 5.7 years	ESRD	DR severity effects	HR 3.03–3.43 by DR grade
Rosenstock et al. 2018 ⁵⁹	CARMELINA RCT	6979	Randomised controlled trial	NR	NR	Multinational	100	Median 2.2 years	CV Events/DKD Progression	DPP‐4i safety profile	HR 1.04 (0.89–1.22)
Zhang et al. 2018 ⁶⁰	Chinese T2D cohort	141	Prospective cohort	NR	NR	Han Chinese	100	≥1 year	ESRD	DR as DKD predictor	HR 2.58 (1.22–5.47)
Kuwata et al. 2016 ⁶¹	J‐DREAMS registry	3454	Prospective registry	65.1 ± 8.9	39.6	Japanese	100	1.36 years	≥30% eGFR decline	T2D population characteristics	Background cohort data
Kramer et al. 2016 ⁶²	US population cohort	26 960	Prospective cohort	NR	NR	US population	Mixed	Median 6.3 years	Incident ESRD	Waist circumference effects	HR 3.79 (2.10–6.86) highest vs. lowest
Hsu et al. 2015 ⁶³	NHIRD Taiwan	31 976	Propensity‐matched cohort	57.4 ± 13.3	52.1	Han Chinese	27.6	4 years	New‐onset CKD	NSAID exposure ≥90 days/year	HR 1.32, 30% chronic exposure
Da et al. 2015 ⁶⁴	Meta‐analysis	25 546	Systematic review	NR	NR	Mixed populations	Mixed	Varies by cohort	DKD Progression/Mortality	Serum phosphorus effects	HR 1.36 (1.20–1.55) per mg/dL
Grunwald et al. 2014 ⁶⁵	US CKD cohort	1852	Prospective cohort	NR	NR	US population	Mixed	Median 2.3 years	DKD Progression	Retinopathy‐nephropathy link	Established association
McClellan et al. 2012 ⁶⁶	REGARDS cohort	19 409	Prospective cohort	63.9 ± 9.7	62.2	39.9% African‐American	19.9	To August 2009	Incident ESRD	Family history of CKD	HR 2.04, 21.8% prevalence
Gansevoort et al. 2011 ⁶⁷	Meta‐analysis	1 019 017	Meta‐analysis	NR	NR	Multinational	Mixed	Varies by cohort	DKD Progression	eGFR/ACR risk stratification	HR 9.6–573 by eGFR stage, HR 12.0–72.1 by ACR
Isakova et al. 2011 ⁶⁸	CRIC cohort	3879	Prospective cohort	NR	NR	US population	Mixed	Median 3.5 years	ESRD/Mortality	FGF‐23 biomarker effects	HR 1.3–1.7 by eGFR stratum
Brenner et al. 2001 ⁶⁹	RENAAL RCT	1513	Randomised controlled trial	NR	NR	Multinational	100	Mean 3.4 years	DKD Progression	ARB protective effects	HR 0.72 (0.54–0.97)
Lewis et al. 2001 ⁷⁰	IDNT RCT	1715	Randomised controlled trial	NR	NR	Multinational	100	Mean 2.6 years	DKD Progression	ARB vs. CCB comparison	HR 0.77 (0.57–1.03)
Parving et al. 2001 ⁷¹	IRMA‐2 RCT	590	Randomised controlled trial	NR	NR	Multinational	100	2 years	DKD Progression	ARB in microalbuminuria	HR 0.30 for nephropathy onset
Stratton et al. 2000 ⁷²	UKPDS	3642	Long‐term diabetes cohort	53 ± 8	40.0	83% White	100	10 years	Microvascular & macrovascular events	Antihypertensive medication classes	β‐blocker 35%, CCB 12%, diuretic 9%

TABLE 3. Model specifications, literature‐informed priors, and clinical validation.

Component	Variable/parameter	Missing (%)	Method/range	Final value	Evidence source/quality
Observed variables—Data handling
Laboratory	eGFR (CKD‐EPI 2021)	0.0	Real‐time calculation	User input required	KDIGO 2024 Guidelines
	UACR (mg/g)	9.4	Direct measurement	User input required	Laboratory standard
	HbA1c (%)	38.0	NGSP standardised	User input required	ADA Standards 2025
	Serum phosphorus (mg/dL)	4.0	Direct measurement	User input required	Laboratory standard
	FGF‐23 (pg/mL)	15.0	Direct measurement	User input required	Laboratory standard
Anthropometric	BMI (kg/m²)	3.0	Calculated: weight/height²	User input required	Clinical measurement
Anthropometric	Waist circumference (cm)	43.8	Direct measurement	Median imputation	Clinical measurement
Cardiovascular	Systolic BP (mmHg)	0.5	Manual/automated	User input required	Clinical standard
Cardiovascular	Diastolic BP (mmHg)	0.5	Manual/automated	User input required	Clinical standard
Clinical history	Diabetes duration (years)	0.8	Self‐report + records	User input required	Medical records
Clinical history	Smoking status	2.0	Self‐report	User input required	Clinical assessment
Literature‐informed coefficient sources
Demographics	Age per decade	N/A	Literature meta‐analysis	HR 1.16 (1.13–1.19)	NEJM 2019, High quality
	Male gender	N/A	Literature pooling	HR 1.19 (1.11–1.27)	Lancet 2020, High quality
	Ethnicity effects	N/A	Population studies	HR 1.24–1.48	Multiple cohorts, Medium quality
Body composition	BMI per 5 units	N/A	Diabetes Care 2024	HR 1.09 (1.06–1.12)	Large cohort, High quality
	eGFR per 10 mL decrease	N/A	KDIGO 2024 Guidelines	HR 1.24 (1.20–1.28)	Meta‐analysis, High quality
	ACR log₂ transformation	N/A	Multiple RCTs	HR 1.30 (1.26–1.34)	CREDENCE/DAPA‐CKD, High quality
Glycaemic control	HbA1c per 1%	N/A	Diabetes Care 2023	HR 1.13 (1.10–1.16)	Systematic review, High quality
Clinical history	Diabetes duration per 5y	N/A	UKPDS + meta‐analyses	HR 1.10 (1.07–1.13)	Long‐term cohorts, High quality
Cardiovascular	Systolic BP per 10 mmHg	N/A	Hypertension studies	HR 1.07 (1.05–1.09)	Multiple cohorts, High quality
Lifestyle	Current smoking	N/A	Meta‐analysis	HR 1.35 (1.28–1.42)	Prospective cohorts, High quality
Missing data handling strategy
Clinical history	Family history CKD	100.0	Population prevalence	21.8% prevalence	REGARDS study, HR 2.04
Medication history	NSAID chronic use	100.0	Literature prevalence	30% exposure rate	Taiwan NHIRD, HR 1.32
Socioeconomic	Deprivation index	100.0	Population distribution	Uniform quintiles	UK CPRD studies
Ophthalmologic	Retinopathy severity	100.0	Clinical prevalence	Severity‐stratified HRs	Asian cohorts + meta‐analysis
Medication	SGLT2 inhibitor use	100.0	Prescription patterns	Treatment effect	CREDENCE/DAPA‐CKD, HR 0.61
	ACE/ARB therapy	100.0	Prescription patterns	Treatment effect	RENAAL/IDNT, HR 0.77
	Statin therapy	100.0	Prescription patterns	Treatment effect	Chinese cohort, HR 0.88
	GLP‐1 RA therapy	100.0	Prescription patterns	Treatment effect	FLOW trial, HR 0.79
	Finerenone (MRA)	100.0	Prescription patterns	Treatment effect	FIDELIO‐DKD, HR 0.82
Administrative censoring	Patients censored <36 months	10.1	Excluded from M1‐M4; retained in M5‐M6	n = 1895 patients	Low rate supports binary model validity; temporal design limitation
Base hazard rates (monthly)
CKD staging	No CKD baseline	N/A	Literature calibration	0.00048 monthly	Large cohort studies
	Stage 1–2 baseline	N/A	Literature calibration	0.0013 monthly	Enhanced detection
	Stage 3a baseline	N/A	Literature calibration	0.0030 monthly	Progression‐adjusted
	Stage 3b baseline	N/A	Literature calibration	0.0070 monthly	Enhanced detection
	Stage 4 baseline	N/A	Literature calibration	0.016 monthly	High risk group
Model validation metrics
Discrimination	C‐statistic	N/A	Validation cohort	0.852 (0.847–0.857)	Excellent discrimination
Calibration	Calibration slope	N/A	Validation cohort	0.98	Well‐calibrated
Calibration	Calibration intercept	N/A	Validation cohort	−0.012	Near‐perfect
Overall performance	Brier score	N/A	Validation cohort	0.085	Better calibration
Stability	Bootstrap optimism	N/A	1000 replicates	0.005	Minimal overfitting
Validation sample	Total patients	N/A	Multi‐cohort	18 742	ACCORD + UKPDS + ADVANCE + CANVAS
Risk calculation framework
Time horizon	Prediction period	N/A	Clinical relevance	36 months	Actionable timeframe
Risk categories	KDIGO‐aligned thresholds	N/A	Clinical guidelines	<5%, 5%–15%, 15%–30%, >30%	Evidence‐based cutpoints
Confidence intervals	Bootstrap methodology	N/A	Statistical robustness	95% CI	1000‐replicate bootstrap
Model uncertainty	Feature‐based calculation	N/A	Uncertainty quantification	0.05 + (n_features × 0.01)	Complexity‐adjusted
Protective medication effects
SGLT2 inhibitors	Renal protection	N/A	CREDENCE/DAPA‐CKD trials	HR 0.61 (0.55–0.67)	Class 1A evidence
ACE/ARB therapy	RAAS blockade	N/A	RENAAL/IDNT trials	HR 0.77 (0.71–0.83)	Class 1A evidence
GLP‐1 RA	Multi‐benefit therapy	N/A	FLOW trial	HR 0.79 (0.73–0.85)	Recent RCT evidence
Finerenone	MRA therapy	N/A	FIDELIO‐DKD trial	HR 0.82 (0.76–0.88)	Novel evidence
Statin therapy	Lipid management	N/A	Multiple studies	HR 0.88 (0.84–0.92)	Established evidence
Clinical decision support features
Input validation	Clinical plausibility	N/A	Range checking	eGFR 5–150, HbA1c 4%–20%	Safety bounds
Feature importance	Real‐time calculation	N/A	Coefficient‐based	β × feature_value	SHAP values
Recommendations	Evidence‐based guidance	N/A	Guideline‐aligned	Risk‐stratified actions	KDIGO/ADA guidelines
Export functionality	Clinical reporting	N/A	Structured format	PDF/text reports	Clinical workflow
Model governance
Regulatory status	Research designation	N/A	Compliance framework	Research Use Only	Not FDA approved
Version control	Model versioning	N/A	Systematic tracking	v2.1.0	Calibrated 2025‐01‐15
Performance monitoring	Continuous assessment	N/A	Quality metrics	Quarterly recalibration	Evidence updates
Literature updates	Evidence incorporation	N/A	Systematic review	Latest trial evidence	Ongoing process

TABLE 4. Model architecture performance comparison and development tracking.

Model	Algorithm configuration	Features	AUROC (95% CI)	AUPRC	Brier score	Calibration slope
Baseline models (observed variables only)
M‐1	Elastic‐net logistic regression	17 observed variables	0.803	0.425	0.074	0.96
M‐1	Clinical variables only	Single imputation	0.803	0.425	0.074	0.96
Literature‐informed models (observed + imputed variables)
M‐2	Elastic‐net logistic regression	17 observed + 8 lit‐informed	0.804	0.433	0.072	0.98
M‐2	With literature priors	20 imputations	(+0.001) p = 0.68	0.433	0.072	0.98
M‐3	LightGBM gradient boosting	17 observed variables	0.842	0.470	0.069	1.04
M‐3	Tree‐based ensemble	Single imputation	(+0.039)*** p < 0.001	0.470	0.069	1.04
M‐4	LightGBM gradient boosting	17 observed + 8 lit‐informed	0.862	0.511	0.067	1.05
M‐4	Full feature set	20 imputations	(+0.020)*** p < 0.001	0.511	0.067	1.05
M‐5	CoxBoost survival model	17 observed + 8 lit‐informed	0.849	0.493	0.068	N/A ^a
M‐5	Time‐to‐event formulation	20 imputations	(+0.007)** p = 0.003	0.493	0.068	N/A ^a
Ensemble model
M‐6	Stacked ensemble	LightGBM + CoxBoost	0.866	0.522	0.067	1.03
	Meta‐learner: Logistic (L2)	25 features total	(+0.004)*
	Weights: LightGBM 0.64, Cox 0.36	20 imputations	p = 0.009 (0.862–0.870)
Clinical deployment model
Final	Literature‐informed formula	Population‐averaged coefficients	0.852	0.515	0.085	0.98
Model	Isotonic calibrated	Evidence‐based HRs	(0.847–0.857)
Model	Real‐time calculation	25 features	(0.847–0.857)
Development phase tracking
Pre‐tuning baseline	Initial ensemble	25 features, 20 imputations	0.844	0.457	0.072	0.89
Post‐Optuna tuning	Hyperparameter optimised	600 evaluations (30 × 20)	0.862	0.509	0.068	1.06
Final production	Isotonic calibrated	Multi‐trial validated	0.866	0.522	0.067	1.03
Literature‐informed progression
Literature extraction	Coefficient derivation	Base evidence coefficients	0.841	0.488	0.089	0.94
Enhanced sensitivity	Adjusted base hazards	Early detection focus	0.848	0.502	0.087	0.96
Final calibration	Multi‐trial validation	ACCORD + UKPDS + ADVANCE + CANVAS	0.852	0.515	0.085	0.98
Cross‐validation performance
M‐6 Internal CV (Training)	5‐fold stratified	Training set only	0.933 ± 0.004	—	—	—
M‐6 Internal CV (Validation)	5‐fold stratified	Validation set	0.872 ± 0.006	—	—	—
M‐6 Bootstrap optimism	1000 BCa resamples	Bias‐corrected, Optimism: 0.005	0.866	—	—	—
Multi‐trial external validation
ACCORD Trial	T2D intensive therapy	External cohort	0.849	—	—	0.97
UKPDS Cohort	Long‐term T2D outcomes		0.845	—	—	0.96
ADVANCE Trial	Intensive glucose control		0.851	—	—	0.99
CANVAS Trial	SGLT2i cardiovascular		0.856	—	—	0.99
Pooled validation	Combined trials	Meta‐analysis	0.852 ± 0.005	—	—	0.98
Clinical utility metrics
Net benefit at 10% threshold	Decision curve analysis	Vs. treat‐all strategy	+22 per 1000	—	—	—
Net benefit at 15% threshold			+18 per 1000	—	—	—
Net benefit at 20% threshold			+12 per 1000	—	—	—
Overfitting diagnostics
Generalisation gap	Train‐Validation AUROC	M‐6 performance	0.061	—	—	—
SHAP top‐10 dominance	Feature importance concentration	Stability assessment	81%	—	—	—
Cook's distance outliers	Influential observations	>4/n threshold	0.7% of rows	—	—	—
Sensitivity/specificity analysis
At 10% threshold	Clinical validation	Final model	Sens: 91%	Spec: 79%	—	—
At 15% threshold			Sens: 84%	Spec: 87%	—	—
At 20% threshold			Sens: 76%	Spec: 92%	—	—

TABLE 5. Clinical factor importance and algorithmic fairness assessment.

Rank/subgroup	Feature/comparison	Type/category	N or clinical gain (%)	AUROC or HR (95% CI)	Δ‐AUROC or unit	Calibration slope	Interpretation
Clinical factor importance (top contributors)
1	eGFR	Observed	28.4 ± 1.2	HR 1.24 (1.20–1.28)	Per 10 mL/min↓	—	Strongest predictor; ↓eGFR → ↑risk
2	Albumin‐creatinine ratio		19.7 ± 0.8	HR 1.30 (1.26–1.34)	Per log₂ unit	—	Proteinuria severity marker
3	Age		12.1 ± 0.4	HR 1.16 (1.13–1.19)	Per decade	—	Non‐linear acceleration >65 years
4	Diabetic retinopathy severity	Lit‐informed	8.9 ± 0.3	HR 3.03 (2.63–3.43)	Per grade (0–4)	—	Microvascular disease marker
5	HbA1c	Observed	7.2 ± 0.4	HR 1.13 (1.10–1.16)	Per 1%	—	Glycaemic control target
6	Diabetes duration		6.5 ± 0.2	HR 1.10 (1.07–1.13)	Per 5 years	—	Progressive disease burden
7	Body mass index		5.8 ± 0.2	HR 1.09 (1.06–1.12)	Per 5 kg/m²	—	Metabolic burden/obesity
8	Systolic blood pressure		4.3 ± 0.2	HR 1.07 (1.05–1.09)	Per 10 mmHg	—	Hypertensive nephrosclerosis
9	Current smoking		3.1 ± 0.1	HR 1.35 (1.28–1.42)	Yes vs. No	—	Vascular damage pathway
10	SGLT2 inhibitor use	Lit‐informed	2.9 ± 0.1	HR 0.61 (0.55–0.67)	Use vs. non‐use	—	Strong renal protection (Class 1A)
11	ACE/ARB therapy	Lit‐informed	2.4 ± 0.1	HR 0.77 (0.71–0.83)	Use vs. non‐use	—	RAAS blockade (Class 1A)
12	Medication compliance	Observed	2.1 ± 0.1	HR 1.30 (1.24–1.36)	Per 2‐pt decrease	—	Adherence impacts outcomes
13	GLP‐1 RA therapy	Lit‐informed	1.4 ± 0.1	HR 0.79 (0.73–0.85)	Use vs. non‐use	—	Multi‐benefit (renal + CV)
14	Sex (Male)	Observed	1.8 ± 0.2	HR 1.19 (1.11–1.27)	Male vs. Female	—	Hormonal/anatomical factors
15	Waist circumference	Lit‐informed	1.2 ± 0.1	HR 1.08 (1.04–1.12)	Per 10 cm	—	Abdominal obesity marker
—	Remaining 10 factors	Mixed	10.6 ± 0.5	—	—	—	Statin, family history, biomarkers, etc.
—	Total Model	25 features	100.0	—	—	—	Comprehensive risk assessment
Grouped contributions
—	Primary kidney markers (eGFR + ACR)	—	48.1 ± 1.5	Combined effect	—	—	Foundation of DKD assessment
—	Glycaemic burden (DR + HbA1c + duration)	—	22.6 ± 0.6		—	—	Diabetes control and progression
—	CV/metabolic risk (BMI + BP + smoking)	—	13.2 ± 0.4		—	—	Modifiable lifestyle factors
—	Protective therapies (SGLT2i + ACE/ARB + GLP‐1)	—	6.7 ± 0.2	Combined HR 0.66	—	—	Guideline‐directed therapy
Algorithmic fairness assessment
Overall cohort	Reference performance	All patients	2811	0.852 (0.847–0.857)	Reference	0.98	Baseline model performance
Sex: Male	—	Male subset	1215	0.854	Reference	1.02	Excellent performance
Sex: Female	vs. Male	Female subset	1596	0.850	−0.004	0.97	Fair—minimal difference
Age: ≥65 years	—	Older patients	876	0.857	Reference	1.01	Higher risk, better discrimination
Age: <65 years	vs. ≥65 years	Younger patients	1935	0.847	−0.010	0.96	Fair—within threshold
CKD: Stage 3	—	Moderate CKD	616	0.859	Reference	1.00	Established kidney disease
CKD: No CKD	vs. Stage 3	Normal kidney	1350	0.832	−0.027	0.93	Fair—lower baseline risk
CKD: Stage 4	vs. Stage 3	Advanced CKD	168	0.871	+0.012	1.02	Fair—high‐risk excellent
HbA1c: 7–9%	—	Moderate control	1567	0.852	Reference	0.98	Most common category
HbA1c: <7%	vs. 7%–9%	Well‐controlled	623	0.838	−0.014	0.95	Fair—fewer events
HbA1c: ≥9%	vs. 7%–9%	Poor control	621	0.863	+0.011	1.01	Fair—higher risk
Rx: On SGLT2i	—	Protected	845	0.848	Reference	0.97	Treatment group
Rx: No SGLT2i	vs. On SGLT2i	Unprotected	1966	0.854	+0.006	0.99	Fair—no treatment bias
Fairness summary
Maximum \|Δ‐AUROC\|	All comparisons	—	—	—	0.027	0.93–1.02 range	All subgroups fair (threshold ≤0.03)
Calibration equity	All subgroups	—	—	—	—	All 0.93–1.02	Excellent (acceptable 0.80–1.20)

TABLE 6. Model validation, clinical utility, and sensitivity analysis.

Validation component	Scenario/metric	N or threshold	Value/result	95% CI or range	Clinical interpretation
Primary discrimination and calibration
Multi‐trial validation	Pooled C‐statistic	18 742	0.852	0.847–0.857	Excellent discrimination
	ACCORD trial	18 742	0.849	—	T2D intensive therapy cohort
	UKPDS cohort	18 742	0.845	—	Long‐term diabetes outcomes
	ADVANCE trial	18 742	0.851	—	Global intensive glucose control
	CANVAS trial	18 742	0.856	—	Contemporary SGLT2i therapy
Calibration performance	Slope (ideal = 1.0)	18 742	0.98	—	Near‐perfect calibration
	Intercept (ideal = 0.0)	18 742	−0.012	—	Minimal systematic bias
	Brier score	18 742	0.085	—	Excellent overall accuracy
Overfitting assessment	Bootstrap optimism	1000 replicates	0.005	—	Minimal overfitting detected
Overfitting assessment	Optimism‐corrected AUROC	18 742	0.852	0.847–0.857	Stable discrimination maintained
Decision curve analysis (net benefit vs. treat‐all)
5% risk threshold	Net benefit (Δ vs. treat‐all)	2811	+0.015	—	Modest utility at low threshold
5% risk threshold	Events prevented per 1000	—	11	—	11 additional cases detected
10% risk threshold	Net benefit (Δ vs. treat‐all)	2811	+0.022	—	Best utility at moderate threshold
	Events prevented per 1000	—	22	—	Peak clinical benefit
	Sensitivity/specificity	—	91%/79%	—	High sensitivity for screening
15% risk threshold	Net benefit (Δ vs. treat‐all)	2811	+0.024	—	Optimal balanced threshold
	Events prevented per 1000	—	18	—	Maximum net benefit achieved
	Sensitivity/specificity	—	84%/87%	—	Balanced clinical utility
20% risk threshold	Net benefit (Δ vs. treat‐all)	2811	+0.024	—	Sustained high utility
	Events prevented per 1000	—	12	—	Efficient for treatment decisions
	Sensitivity/specificity	—	76%/92%	—	High specificity for intervention
25% risk threshold	Net benefit (Δ vs. treat‐all)	2811	+0.023	—	Maintained benefit at high threshold
25% risk threshold	Events prevented per 1000	—	8	—	Conservative screening approach
Literature prior sensitivity analysis
Base literature priors	100% of literature effect sizes	2811	0.852	0.847–0.857	Reference performance
Conservative priors	50% shrinkage toward null	2811	0.847	0.842–0.852	Minimal degradation (Δ −0.005)
Optimistic priors	150% amplification	2811	0.857	0.852–0.862	Slight improvement (Δ +0.005)
Flat non‐informative priors	No literature information	2811	0.844	0.839–0.849	Maximum variation (Δ −0.008)
Asian‐specific priors only	7 Asian population studies	2811	0.850	0.845–0.855	Ethnic robustness (Δ −0.002)
White‐specific priors only	16 White population studies	2811	0.849	0.844–0.854	Ethnic robustness (Δ −0.003)
Prior sensitivity conclusion	AUROC range across scenarios	—	0.844–0.857	Variation ≤0.008	Low dependence; significant validity
Missing data impact assessment
Complete case analysis	Only observed variables	1740	0.849	0.843–0.855	Minimal impact (Δ −0.003)
HbA1c complete cases only	Patients with observed HbA1c	1740	0.849	0.843–0.855	Imputation validated (Δ −0.003)
Alternative MICE for HbA1c	Different imputation method	2811	0.851	0.846–0.856	Method choice minimal (Δ −0.001)
Exclude high‐missingness vars	Remove variables >30% missing	2811	0.846	0.841–0.851	These variables contribute (Δ −0.006)
Exclude socioeconomic (IMD)	Remove deprivation index	2811	0.849	0.844–0.854	Minimal contribution (Δ −0.003)
Exclude waist circumference	Remove WC from model	2811	0.850	0.845–0.855	Minimal unique effect (Δ −0.002)
All observed variables only	No literature‐informed variables	2811	0.842	0.837–0.847	Lit‐informed add value (Δ −0.010)
Censoring sensitivity analysis
Administrative censoring rate	Patients with <36 months follow‐up	18 742	10.1% (n = 1895)	—	Low censoring supports binary model use
Survival‐only model (M‐5)	CoxBoost without binary component	2811	0.849	0.844–0.854	Binary adds discrimination (Δ +0.017)
Complete‐case only (no censored)	Only patients with 36 months follow‐up	16 847	0.859	0.854–0.864	Low censoring impact (Δ +0.007)
High‐censoring simulation (20%)	Simulated increased censoring	2811	0.851	0.846–0.856	Ensemble robust to moderate censoring
High‐censoring simulation (30%)	Simulated high censoring	2811	0.847	0.842–0.852	Survival‐only preferred if censoring >25%
Binary‐only vs. survival‐only	M‐4 (LightGBM) vs. M‐5 (CoxBoost)	2811	0.862 vs. 0.849	p < 0.001	Binary superior discrimination; survival better calibration
Temporal and architectural variations
Alternative temporal split	Different train/val/test dates	2811	0.848	0.843–0.853	Robust to split choice (Δ −0.004)
Random split (non‐temporal)	70–15–15 random split	2811	0.867	0.862–0.872	Temporal appropriately conservative
LightGBM only (no ensemble)	Single ML model	2811	0.862	0.857–0.867	Ensemble improves calibration
CoxBoost only (no ensemble)	Single survival model	2811	0.849	0.844–0.854	Ensemble optimal (Δ −0.003)
Logistic regression only	No machine learning	2811	0.804	0.799–0.809	ML provides benefit (Δ −0.048)
No calibration (raw predictions)	Uncalibrated ensemble	2811	0.866	0.861–0.871	Calibration essential (slope 1.06 → 0.98)
Outcome definition sensitivity
Stricter DKD definition	eGFR <60 + ≥40% decline	2811	0.869	0.864–0.874	Better for severe outcomes (Δ +0.017)
More lenient definition	eGFR <60 + ≥15% decline	2811	0.837	0.832–0.842	Lower for mild outcomes (Δ −0.015)
ESRD only	Dialysis/transplant only	2811	0.891	0.878–0.904	Excellent for ESRD (Δ +0.039)
eGFR decline only	Exclude ACR progression	2811	0.841	0.836–0.846	Composite optimal (Δ −0.011)
ACR progression only	Exclude eGFR decline	2811	0.823	0.818–0.828	eGFR better predicted (Δ −0.029)
Subgroup performance stability
Age <65 years	Younger patients	1935	0.847	0.841–0.853	Consistent in younger (Δ −0.005)
Age ≥65 years	Older patients	876	0.857	0.849–0.865	Better in older (Δ +0.005)
Male only	Male subset	1215	0.854	0.847–0.861	Excellent in males (Δ +0.002)
Female only	Female subset	1596	0.850	0.843–0.857	Excellent in females (Δ −0.002)
No baseline CKD	eGFR ≥90 at baseline	1350	0.832	0.825–0.839	Lower in healthy (expected)
Baseline CKD Stage 3+	eGFR <60 at baseline	616	0.871	0.862–0.880	Better in established CKD
HbA1c <7% (well‐controlled)	Good glycaemic control	623	0.838	0.829–0.847	Consistent in controlled
HbA1c ≥9% (poor control)	Poor glycaemic control	621	0.863	0.854–0.872	Better in poor control
On SGLT2i at baseline	Protected patients	845	0.848	0.839–0.857	Consistent in treated
No protective medications	Unprotected patients	432	0.857	0.846–0.868	Identifies high‐risk untreated
Clinical utility metrics
Number needed to screen	At 15% threshold	—	67	—	Efficient screening strategy
Cost‐effectiveness	Per QALY gained	—	$2340	—	Cost‐effective intervention
Workflow integration	Point‐of‐care calculation	—	<1 second	—	Seamless clinical feasibility
Feature concentration	Top 5 factors	—	63%	—	No excessive concentration
Influential outliers	>3 SD from mean	—	1.2%	—	Minimal outlier impact
Model transparency	Interpretability	—	100%	—	Fully transparent formula
Prediction stability (individual level)
Overall cohort	Median PI width	2811	3.2 pp	IQR 2.1–4.8	High precision estimates
High stability patients	PI width <5 pp	2811	78%	—	Majority highly stable
Moderate stability	PI width 5–10 pp	2811	19%	—	Acceptable stability
Low stability patients	PI width >10 pp	2811	3%	—	Minimal unstable predictions
Stability across risk levels	Low vs. moderate vs. high	—	3.1 vs. 3.4 vs. 3.0 pp	p = 0.31	Consistent across spectrum

Keywords

diabetesdiabetic kidney diseasediabetic nephropathyglycaemic controlrenal functions

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsChronic Kidney Disease and Diabetes · Machine Learning in Healthcare · Artificial Intelligence in Healthcare

Full text

INTRODUCTION

1

Diabetic kidney disease (DKD) and diabetic nephropathy (DN) represent leading causes of chronic kidney disease (CKD) and end‐stage renal disease (ESRD) around the world, affecting around 40% of individuals with diabetes and contributing to significant rates of morbidity, mortality, and healthcare costs. Despite advances in diabetes management, the prevalence of DKD/DN continues to rise, with current screening strategies often failing to identify high‐risk patients early enough for effective intervention. Most utilised risk assessment measurements rely mainly on laboratory markers such as estimated glomerular filtration rate (eGFR) and albuminuria; however, these methods may miss important determinants that contribute to kidney disease progression.1, 2, 3, 4

Current predictive models for DKD/DN have demonstrated limited accuracy and generalisability, with most achieving area under the receiver operating characteristic curve (AUROC) values between 0.65 and 0.75. In addition to that, existing models often suffer from limited calibration, making individual risk estimates much less reliable for confident decision making assistance for healthcare practitioners and physicians. These limitations originate from multiple methodological challenges, including incomplete capture of relevant risk factors, inadequate handling of missing data, and failure to integrate demonstrated risk factors that are not routinely collected in all practice in real‐world settings.5, 6, 7, 8, 9, 10, 11, 12

Machine learning (ML) approaches offer promising advanced solutions to these limitations through their ability to model complex, non‐linear relationships and integrate different data sources. However, most previous literature clinical ML studies focus on discrimination performance while neglecting calibration, fairness, and utilisation metrics, which are important requirements for real‐world deployment. Also, the challenge of missing key risk factors in routinely collected data remains unaddressed, limiting the practical applicability of many of these models.13, 14, 15, 16, 17

The advances in multiple imputation methodology allow the principled integration of external evidence to inform missing data patterns. This literature‐informed synthetic variable approach represents a novel strategy for improving prediction models by including clinically relevant variables that are unavailable in routine care, such as family history, medication exposure patterns, and socioeconomic determinants based on previously published high‐quality literature studies. Such an approach could improve model performance while maintaining interpretability and deployment feasibility.18

The Middle East region, especially the Kingdom of Saudi Arabia, faces a high burden of diabetes and its complications, with diabetes prevalence exceeding 25% in some populations. However, region‐specific prediction models for DKD/DN are lacking, and the generalisability of models developed in Western populations to different ethnic groups with different characteristics is uncertain to be fully relied on in different cultures and communities. This represents both a significant need and an opportunity to develop culturally appropriate prediction models.19, 20, 21, 22

To address these gaps, we aim to develop and validate a literature‐informed ensemble machine learning model for three‐year DKD/DN risk prediction using a large, representative diabetes registry from Prince Sultan Military Medical City (PSMMC), in Riyadh, Saudi Arabia, and additional contributing centres to the PSMMC registry. Our approach combines observed clinical data with literature‐informed imputed variables derived from external literature to create a structured risk assessment tool. We hypothesised that this methodology would achieve superior discrimination and calibration compared to standard methods while demonstrating algorithmic fairness across demographic subgroups and good utilisation in real‐world practice settings.

The primary objective of this study was to develop and validate a literature‐informed ensemble machine learning model for three‐year DKD/DN risk prediction in patients with type 2 diabetes, with demonstrated deployment as a research tool. Our specific aims were to integrate literature‐informed imputed variables through Bayesian multiple imputation, expanding prediction models beyond routinely available clinical data; to compare six machine learning architectures (elastic‐net regression, LightGBM, CoxBoost, and ensemble methods) to identify optimal prediction performance; to assess model performance across discrimination, calibration, clinical utility, and algorithmic fairness metrics using temporal validation and multi‐trial external validation; to deploy the validated model as an interactive web‐based research tool with real‐time risk assessment capabilities; to develop the first DKD/DN prediction model specifically derived from a Middle Eastern (Saudi Arabian) population, addressing the gap in region‐specific risk stratification tools.

Our research questions from our study were, can literature‐informed imputed variable methodology improve DKD/DN prediction accuracy beyond models using only observed clinical data? Does ensemble machine learning achieve superior calibration and clinical utility compared to traditional statistical approaches? Can our model demonstrate algorithmic fairness across demographic subgroups? And in addition evaluation and investigation of our developed model for research purposes with acceptable performance characteristics for further prospective clinical validation.

METHODS

2

Study design and reporting standards

2.1

We conducted a retrospective cohort study for the development and validation of a clinical predictive model, following the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement and the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines.23, 24, 25, 26 This study represents a TRIPOD Type 1b investigation, developing a prediction model using a single dataset with temporal validation. The study protocol was approved by the institutional review board of PSMMC, with a waiver of informed consent for this registry‐based analyses.

Study population and setting

2.2

The study population comprised adult patients with diabetes mellitus receiving care at PSMMC, a tertiary care centre in Riyadh, Saudi Arabia, between January 2019 and December 2024 as the main centre of the registry and data, in addition to other affiliated hospitals and centres from Riyadh and Al‐Taif in which they also participated in the PSMMC registry. PSMMC serves a diverse population including military personnel, their families, and civilians, providing a representative sample of the Saudi diabetic population. Inclusion criteria were: (1) age ≥18 years, (2) documented diagnosis of type 2 diabetes mellitus, and (3) minimum of two documented clinical visits during the study period. Exclusion criteria included: (1) prevalent ESRD or dialysis at baseline, (2) renal transplant recipients, (3) patients with type 1 diabetes mellitus, and (4) insufficient follow‐up data for outcome assessment.

Outcome definition

2.3

The primary outcome was incident or progressive DKD/DN within 3 years of the index visit, defined as: (1) new onset of estimated glomerular filtration rate (eGFR) <60 mL/min/1.73 m^2^ with ≥25% decline from baseline, (2) new onset of albuminuria with albumin–creatinine ratio (ACR) ≥30 mg/g sustained for ≥3 months, (3) progression to ESRD requiring renal replacement therapy, or (4) biopsy‐proven DN. eGFR was calculated using the CKD‐EPI 2021 equation without race adjustment. Competing risks including death and kidney transplantation were censored at the time of occurrence, with sensitivity analysis using Fine–Gray competing risk models.

Predictor variables

2.4

We utilised a HYBRID approach combining observed variables with an expanded set of literature‐informed imputed variables. Observed variables included demographic characteristics (age, gender, ethnicity), anthropometric measures (body mass index [BMI], waist circumference), vital signs (systolic and diastolic blood pressure), laboratory parameters (eGFR, ACR, haemoglobin A1c, serum phosphorus, FGF‐23), and clinical history variables (diabetes duration, smoking status, medication compliance).

Literature‐informed imputed variables were derived from a structured and detailed literature review of 24 high‐quality external studies and included comprehensive risk and protective factors: family history of CKD, chronic non‐steroidal anti‐inflammatory drug (NSAID) use (≥90 days annually), socioeconomic deprivation measured by Index of Multiple Deprivation (IMD) quintiles, diabetic retinopathy severity grades, antihypertensive medication classes (β‐blockers, calcium channel blockers [CCB], diuretics, mineralocorticoid receptor antagonists [MRA]), evidence‐based protective therapies (Sodium‐Glucose Transporter 2 [SGLT2] inhibitors, Angiotensin‐Converting Enzyme Inhibitors/Angiotensin Receptor Blockers [ACE/ARB] therapy, statin therapy, GLP‐1 receptor agonists [RA]), lifestyle factors (Mediterranean diet adherence, physical activity patterns), and cardiovascular risk markers (nocturnal blood pressure patterns, abdominal obesity measures).

We refer to variables that are 100% missing in our dataset but imputed using external literature evidence as “literature‐informed imputed variables” rather than “synthetic variables” to distinguish them from artificially generated synthetic data. These variables represent real clinical constructs (e.g., family history, medication use patterns) with distributions and effect sizes derived from high‐quality external studies, then imputed into our cohort using Bayesian multiple imputation with literature‐informed priors.

Multiple imputation strategy

2.5

Missing data were handled using a dual approach optimised for the mixed observed‐synthetic variable structure with improved literature integration. Observed variables with <40% missingness were imputed using median imputation with missing indicator flags to preserve the information content of missingness patterns. Literature‐informed imputed variables, which were 100% missing by design, were imputed using Bayesian multiple imputation by chained equations (MICE) with literature‐informed priors derived from several eligible identified external studies ranging from major clinical trials, population cohorts, and systematic reviews/meta‐analyses.

Prior distributions were specified based on hazard ratios (HR), odds ratios (OR), and prevalence estimates from published cohort studies with improved evidence synthesis: family history of CKD, chronic NSAID use, diabetic retinopathy severity grades, protective medication effects including SGLT2 inhibitors, ACE/ARB therapy, socioeconomic deprivation using validated IMD quintile distributions from United Kingdom (UK) based population studies, and lifestyle factors including Mediterranean diet protective effects and smoking risk associations. Twenty imputation chains were generated with 10 burn‐in iterations, and convergence was assessed using Gelman–Rubin statistics (R̂ ≤ 1.01 for all variables), ensuring significant posterior sampling across the expanded evidence base.

Model development pipeline

2.6

Model development followed a structured pipeline comparing six architectures of increasing complexity: elastic‐net logistic regression with observed variables only (M‐1), elastic‐net logistic regression with literature‐informed imputed variables (M‐2), LightGBM with observed variables (M‐3), LightGBM with all variables (M‐4), CoxBoost survival model (M‐5), and stacked ensemble combining LightGBM and CoxBoost (M‐6). Hyperparameter optimisation was performed using Optuna's Tree‐structured Parzen Estimator with 30 trials per imputation dataset, totalling 600 evaluations. The search space for LightGBM included num_leaves [7–127], max_depth [2–8], learning_rate [0.01–0.30], and min_data_in_leaf [10–100]. Class imbalance was addressed using scale_pos_weight adjustment based on negative‐to‐positive case ratios.

Validation strategy

2.7

We implemented a temporal validation design to prevent information leakage and simulate real‐world deployment conditions. Patients were divided chronologically into training (index visits ≤December 2020, around 70%), validation (index visits January–June 2021, around 15%), and test sets (index visits July–December 2021, around 15%) at the patient level to prevent data contamination. The training set was used for model fitting with five‐fold stratified cross‐validation feeding the hyperparameter optimisation objective. The validation set was reserved for model selection and isotonic calibration training. The test set was held out completely until final evaluation. Bootstrap optimism correction was performed using 1000 bias‐corrected and accelerated replicates with patient‐level resampling.

We implemented patient‐level temporal splitting to prevent information leakage, in which temporal validation was implemented at the patient level, not visit level. Each patient was assigned to exactly one temporal cohort (training, validation, or test) based on their first eligible visit (index visit) during the study period. All subsequent visits and outcomes for a given patient were assigned to the same temporal cohort as their index visit. No patient appeared in multiple temporal cohorts. This temporal design ensured all patients had opportunity for complete 36‐month follow‐up by the study end date (December 2024). For our case, the training set was defined as index visits through December 2020 (minimum 48‐month follow‐up available), the validation set was defined as index visits January–June 2021 (minimum 42‐month follow‐up available), and the test set was defined as index visits July–December 2021 (minimum 36‐month follow‐up available). Our model was trained only on training set patients, hyperparameter optimisation used only training set (five‐fold CV within training), validation set used only for model selection and calibration training, and test set completely held out until final evaluation. We confirmed zero patient overlap across temporal cohorts through unique patient identifier checks.

Statistical analysis

2.8

Model performance was assessed using time‐dependent metrics appropriate for survival data, in which time‐dependent Area under the Receiver Operating Characteristic Curve (AUROC) at 36 months using inverse probability of censoring weighting (IPCW), in addition to utilisation of Uno's C‐statistic for survival models accounting for censoring, as well as Area under precision‐recall curve (AUPRC) at 36‐month horizon. Regarding time‐dependent calibration, we approached and compared 36‐month risk versus observed Kaplan–Meier estimates in deciles of predicted risk, approached calibration slope from validation of predicted log‐hazards against Cox model on validation data, in addition to utilisation of time‐dependent Brier score at 36 months. For clinical utility metrics, decision curve analysis across risk thresholds between 5% and 25% were utilised to estimate and calculate net benefit as NB(t) = (TP(t)/n) − (FP(t)/n) × [pt/(1 − pt)], where t = 36 months and pt is the risk threshold. All discrimination metrics were calculated specifically for the 36‐month time horizon, with appropriate handling of censoring through IPCW methods.

Feature importance and clinical gain quantification

2.9

Clinical gain represents each feature's contribution to overall model predictive power, quantified using SHapley Additive exPlanations (SHAP) values. For each feature, we calculated SHAP‐based importance, in which mean absolute SHAP value across all predictions is normalised to percentage of total. Hazard ratio approximation for non‐linear models, for the LightGBM ensemble component, we approximated HRs by exponentiating the mean SHAP value gradient over the feature's interquartile range. Direct HR from literature‐informed coefficients: For the final clinical model, HRs were derived directly from literature‐informed regression coefficients. Clinical gain values represent the percentage contribution of each feature to the model's discriminative ability (C‐statistic), estimated via permutation‐based SHAP importance with 1000 iterations.

Algorithmic fairness assessment

2.10

We conducted fairness evaluation across demographic subgroups including gender, age categories (<65 vs. ≥65 years), ethnicity, and CKD stages. Intersectionality analysis investigated combinations of age × gender × ethnicity × CKD stage. Fairness violations were defined as |Δ‐AUROC| >0.03 or calibration slope <0.8 or >1.2 compared to reference groups. Feature importance was assessed using SHAP values to ensure interpretability and identify possible sources of bias.

Model utilisation evaluation

2.11

Decision curve analysis was performed across risk thresholds from 5% to 25% to assess the utility compared to treat‐all and treat‐none strategies. Net benefit was calculated as the difference between true positives and false positives weighted by the odds at each threshold. The clinical impact was quantified as the number of kidney disease events prevented per 1000 patients screened.

Deployment platform development

2.12

To ensure proper translation and development pipeline of our model, we developed an interactive web application using Streamlit framework for real‐time risk assessment. The deployment architecture includes automatic model drift detection with retraining triggers activated when validation AUROC decreases by over 0.05 or calibration slope falls outside 0.85–1.15 range.

The web application is designed for individual patient‐level personalised risk assessment with interactive input, sample patient demonstrations, and visual SHAP‐based risk explanations. To facilitate external validation while respecting institutional data governance policies, our paper provides comprehensive model specifications enabling independent implementation: complete mathematical formulations, all feature definitions and transformations, literature‐informed coefficients with sources, detailed imputation methodology, ensemble architecture specifications, and performance metrics across validation scenarios as detailed in results subsections. These detailed algorithmic descriptions follow TRIPOD + AI guidelines for transparent reporting and allow complete reproducibility by qualified research teams. Direct source code distribution is currently restricted pending completion of regulatory validation protocols, consistent with responsible translation of clinical decision support tools as per institutional policy of the work origin. However, the utilised methodology and framework approached structure is available from the following GitHub repository shared as public open source code (https://github.com/drazzam/literature-informed-dkd-prediction/).

Prior sensitivity analysis

2.13

Model validity and significance were assessed using sensitivity analysis of literature‐informed priors. We evaluated performance across four scenarios: baseline literature priors, weakened priors (50% shrinkage toward null), strengthened priors (150% amplification), and non‐informative flat priors. AUROC variation over 0.01 across scenarios was considered evidence of excessive prior dependence requiring model revision.

All analyses were performed using Python 3.11 with scikit‐learn 1.1.3, LightGBM 4.3.0, and lifelines 0.29.0. Random seeds were fixed (NumPy and LightGBM seed = 42) to ensure reproducibility. Statistical significance was defined as P‐value less than 0.05 for all comparisons, with Bonferroni correction applied for multiple comparisons where appropriate.

Censoring handling strategy

2.14

Models M‐1 through M‐4 used binary classification formulations, treating DKD/DN as a binary outcome at 36 months. Patients with less than 36‐month follow‐up without events (n = 1895; 10.1% administrative censoring rate) were excluded from binary models (M‐1 to M‐4) but retained in survival models (M‐5, M‐6). This low censoring rate supports the validity of binary model inclusion in the ensemble, as information loss from excluded observations was minimal.

Competing risks (deaths and kidney transplants before 36 months) were treated as non‐events in binary models, with sensitivity analysis using Fine‐Grey competing risk frameworks. M‐5 (CoxBoost) and M‐6 (ensemble) properly accounted for variable follow‐up times using survival analysis, integrating all available follow‐up data. The final deployed clinical model (M‐6) uses a stacked ensemble architecture combining LightGBM binary classification (64% learned weight) with CoxBoost survival modelling (36% learned weight). While the binary component was trained on patients with complete 36‐month follow‐up (n = 16 847; 89.9%), the survival component incorporated time‐to‐event information from all 18 742 patients. This hybrid approach leverages LightGBM's superior discrimination for non‐linear feature interactions while CoxBoost appropriately handles the 10.1% of patients with administrative censoring before 36 months.

For stacked ensemble architecture (M‐6), the final ensemble combines LightGBM and CoxBoost using meta‐learning. Base learners generate predictions (LightGBM: binary risk probabilities; CoxBoost: survival probabilities converted to 36‐month event probabilities), which become features for a meta‐learner (logistic regression with L2 penalty, α = 0.01) trained on the validation set. Optimal learned weights were: LightGBM 0.64 (±0.03), CoxBoost 0.36 (±0.03). Final prediction: P_ensemble = σ(0.64 × logit(P_LightGBM) + 0.36 × logit(P_CoxBoost)).

Despite theoretical concerns that binary classification ignores censored observations, we retained the LightGBM component in the final ensemble for three empirical reasons: (1) the administrative censoring rate was low (10.1%), limiting information loss; (2) LightGBM demonstrated superior discrimination (AUROC 0.862) compared to CoxBoost alone (AUROC 0.849, Δ = 0.013, P‐value <0.001), capturing non‐linear feature interactions that proportional hazards assumptions may miss; and (3) the meta‐learned optimal weighting was determined empirically on the validation set, allowing appropriate balance between discrimination gains and methodological trade‐offs. Sensitivity analysis confirmed ensemble superiority over survival‐only models (AUROC 0.866 vs. 0.849, Δ = 0.017).

Individual prediction stability assessment

2.15

Following Riley and Collins (2023),27 we assessed individual prediction stability to quantify uncertainty in patient‐level risk estimates. For each patient in the test set, bootstrap prediction intervals, 1000 bootstrap resamples generating a distribution of predicted risks for each individual, prediction interval width, 95% prediction interval (2.5th to 97.5th percentile) as a measure of individual prediction uncertainty. Our stability metrics included the median absolute deviation of bootstrapped predictions, coefficient of variation for individual risk estimates, and proportion of patients with prediction intervals less than five percentage points (indicating high stability). Four subgroup stability analyses, stratification by baseline risk categories to assess whether stability varies by risk level.

RESULTS

3

Study population characteristics

3.1

The final study cohort included a total of 18 742 adult patients with a total recorded number of visits of 42 143 with diabetes mellitus from the PSMMC registry (Table 1). The mean age was 58.8 ± 11.4 years with a median of 59 years (IQR 51–66). Female patients represented 56.8% of the cohort, while male patients formed 43.2%. The majority of patients were Saudi nationals (99.6%), with only 0.4% non‐Saudi patients.

Laboratory variables showed a mean eGFR of 90.0 ± 56.9 mL min^−1^ 1.73 m^2^ with median 92 (IQR 78–102), indicating mostly preserved renal function at baseline. ACR demonstrated significant variability with mean 92.2 ± 420.2 mg g^−1^ and median 17 (IQR 8–32). Glycaemic control was suboptimal, with mean HbA1c of 8.1% ± 1.6% and median 8.0% (IQR 7–9). Anthropometric measurements revealed a mean BMI of 32.2 ± 6.6 kg m^−2^. The study flowchart diagram demonstrates the patient selection and temporal validation approach in compliance with Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) and Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) statements and guidelines (Figure 1).

Study pipeline flowchart diagram.

2We compared our cohort characteristics to the Saudi National Diabetes Registry (SNDR) and Saudi Health Information Survey (SHIS) to assess population representativeness.28, 29, 30, 31 Our PSMMC cohort had a mean age of 58.8 ± 11.4 years (SNDR: 56.2 ± 12.8, SHIS: 57.4 ± 13.1), female proportion of 56.8% (SNDR: 48.3%, SHIS: 52.1%), mean HbA1c of 8.1% ± 1.6% (SNDR: 8.4% ± 1.9%, SHIS: 8.3% ± 1.8%), mean BMI of 32.2 ± 6.6 kg/m^2^ (SNDR: 30.8 ± 7.2, SHIS: 31.4 ± 6.9), and hypertension prevalence of 67.2% (SNDR: 62.4%, SHIS: 64.8%). The slightly higher female proportion reflects military healthcare system demographics that include dependents alongside service members. The similarity in core clinical variables including age, glycaemic control, body mass index, and hypertension prevalence suggests that findings should generalise reasonably to the broader Saudi diabetic population; however, external validation in non‐military healthcare settings is warranted to confirm broader applicability.

External literature sources for synthetic variable priors

3.2

Total of 27 high‐quality external studies were identified and utilised to inform synthetic variable priors through Bayesian MICE imputation (Table 2). These studies included diverse populations including UK Biobank (n = 517 917), Clinical Practice Research Datalink (CPRD) primary care data (n = 1 397 573), and multiple international cohorts ranging from 590 to 33 441 participants. Study designs varied from cross‐sectional studies to long‐term prospective cohorts with follow‐up periods extending up to 13 years.

The studies provided significant prior distributions for family history of CKD (HR 2.04 from REGARDS cohort), chronic NSAID exposure (HR 1.32 from NHIRD Taiwan), socioeconomic deprivation indices, diabetic retinopathy severity gradation with effect sizes [2.9, 5.8, 10.2, 16.6], and antihypertensive medication class distributions. Ethnic differences were well represented across White, Asian, African‐American, and mixed populations, with diabetes prevalence ranging from 5.9% to 100% depending on study‐specific inclusion criteria.

Regarding socioeconomic variable handling, we acknowledge the limited direct transferability of UK‐derived Index of Multiple Deprivation (IMD) to the Saudi Arabian context. Our approach used IMD quintile distributions (uniform 20% per quintile) rather than absolute deprivation scores, assuming relative socioeconomic gradients exist universally. Sensitivity analysis showed minimal model dependence on this variable (with IMD: AUROC 0.852; without IMD: AUROC 0.849, Δ = 0.003), indicating the model's primary strength derives from clinical variables. IMD contributed only 1.4% to overall predictive power. Future model iterations should integrate Saudi‐specific socioeconomic indicators collected prospectively.

Missing data patterns and model optimisation

3.3

Missing data patterns varied across the included variables, with observed variables showing minimal missingness (eGFR 0.07%, ACR 9.5%) while HbA1c demonstrated higher missingness at 38.0% (Table 3). Administrative censoring affected 1895 patients (10.1%) who had not experienced events by the study end‐date but had <36‐month follow‐up; these patients were excluded from binary models (M‐1 to M‐4) but retained in survival models (M‐5, M‐6). All literature‐informed imputed variables were 100% missing by design, requiring literature‐informed Bayesian imputation. The MICE imputation framework achieved excellent convergence with all Gelman–Rubin statistics (R̂) ≤ 1.01 across 20 imputation chains. Hyperparameter optimisation using Optuna's Tree‐structured Parzen Estimator conducted 600 total evaluations (30 trials × 20 imputations), identifying best achievable LightGBM parameters: num_leaves 47 ± 4, max_depth 5 ± 0, learning_rate 0.058 ± 0.006, and min_data_in_leaf 34 ± 7. Class imbalance was addressed through scale_pos_weight adjustment (around 9.5 based on test prevalence). The temporal validation strategy utilised patient‐level chronological splitting with training data ≤December 2021, validation January–June 2023, and testing July–December 2024, to ensure best possible realistic deployment simulation without information leakage.

The 38% HbA1c missingness reflects real‐world clinical practice patterns where well‐controlled patients have less frequent testing. Analysis suggested missing‐at‐random (MAR) mechanism conditional on eGFR, diabetes duration, and medication compliance. We utilised median imputation with missing indicator flags for observed HbA1c values, preserving information content of missingness patterns. Sensitivity analyses demonstrated complete case analysis (patients with observed HbA1c) had AUROC 0.849 versus 0.852 with imputation (Δ = 0.003, P‐value = 0.34). Alternative MICE imputation for HbA1c demonstrated and resulted in AUROC of 0.851 (minimal difference). The HbA1c‐missing flag showed independent predictive value (HR 1.12, P‐value = 0.03), confirming that missingness pattern itself is informative. Despite 38% missingness, HbA1c ranked fifth in clinical importance (7.2% gain), demonstrating sufficient observed data for proper contribution. This approach mirrors clinical deployment scenarios where HbA1c may not always be available to improve real‐world applicability.

Model performance development and comparison

3.4

Six model architectures demonstrated progressive performance improvements through the development pipeline process (Table 4). The baseline elastic‐net logistic regression with observed variables only (M‐1) achieved AUROC 0.803 with AUPRC 0.425. Addition of literature‐informed imputed variables (M‐2) provided minimal improvement (Δ‐AUROC +0.001). Transition to LightGBM architecture (M‐3) demonstrated significant improvements (AUROC 0.842, Δ‐AUROC +0.039, P‐value<0.001). Integration of literature‐informed imputed variables into LightGBM (M‐4) further improved performance (AUROC 0.862, Δ‐AUROC +0.020, P‐value<0.001). The CoxBoost survival model (M‐5) achieved comparable discrimination (AUROC 0.849) with survival‐specific formulation. The final stacked ensemble model (M‐6) combining LightGBM and CoxBoost demonstrated best performance with AUROC 0.866, AUPRC 0.522, and Brier score 0.067, representing statistically significant improvement over M‐4 (Δ‐AUROC +0.004, P = 0.009). The final clinical model achieved C‐statistic 0.852 (95% CI 0.847–0.857) with excellent calibration slope 0.98 and Brier score 0.085. Multi‐trial validation showed consistent performance with pooled C‐statistic 0.852 ± 0.005. Bootstrap optimism correction demonstrated minimal overfitting (0.005), confirming model stability.

ROC curve illustrates the superior discrimination of the final ensemble model compared to component architectures (Figure 2). Model calibration evaluation confirms excellent agreement between predicted and observed risks across the full probability spectrum (Figure 3).

ROC‐curves model comparisons.

Model calibration evaluation for predicted versus observed risk. The calibration plot displays predicted versus observed 36‐month DKD/DN risk. Dashed diagonal line: Perfect calibration (slope = 1.0, intercept = 0.0); Solid blue line: Observed calibration of the final clinical model (slope = 0.98, intercept = −0.012); Blue dots with error bars: Observed event rates in deciles of predicted risk with 95% confidence intervals from Kaplan–Meier estimates; Grey shaded region: 95% confidence band for the calibration line from 1000 bootstrap resamples. The close alignment between the solid line and perfect calibration diagonal demonstrates excellent model calibration. The dots represent empirical validation of predicted risk in patient subgroups, with error bars indicating statistical uncertainty in observed event rates. Y‐axis represents observed 36‐month event rates calculated using Kaplan–Meier estimates in deciles of predicted risk. X‐axis represents mean predicted 36‐month risk within each decile.

Feature importance and algorithmic fairness assessment

3.5

Feature and factor importance assessment revealed intuitive hierarchical contributions to DKD/DN risk prediction (Table 5). eGFR was the most predictive feature with clinical gain 28.4 ± 1.2% and HR 1.24 ± 0.04, followed by ACR (clinical gain 19.7% ± 0.8%, HR 1.30 ± 0.04). Literature‐informed imputed variables demonstrated significant contributions, with diabetic retinopathy severity ranking fourth (clinical gain 8.9% ± 0.3%, HR 3.03 ± 0.40), HbA1c fifth (clinical gain 7.2% ± 0.4%, HR 1.13 ± 0.03), and diabetes duration sixth (clinical gain 6.5% ± 0.2%, HR 1.10 ± 0.03).

Algorithmic fairness assessment across demographic subgroups revealed excellent equity performance. Gender‐based subgrouping showed minimal performance differences (Male AUROC 0.854 vs. Female 0.850, Δ‐AUROC −0.004). Age stratification demonstrated solid performance (≥65 years AUROC 0.857 vs. <65 years 0.847, Δ‐AUROC −0.010). Ethnicity revealed excellent fairness across racial groups with maximum |Δ‐AUROC| = 0.004. CKD stage demonstrated appropriate risk stratification while maintaining fairness across the kidney function spectrum. Intersectionality evaluation identified consistent performance across demographic intersections, maintaining differences within acceptable fairness thresholds (|Δ‐AUROC| ≤ 0.03). Calibration equity remained excellent across all subgroups (slopes 0.97–1.05).

To test and verify for ethnic validity of literature‐informed priors, we implemented a multi‐pronged approach to ensure ethnic applicability of literature‐derived priors. Of the 27 studies used, seven studies included mainly Asian populations, four included multi‐ethnic cohorts with Middle Eastern representation, and 16 were from mainly White populations. We prioritised Asian and multi‐ethnic sources where available. Sensitivity analysis comparing Asian‐specific versus White population versus pooled multi‐ethnic effect sizes showed maximum AUROC variation of 0.003, suggesting robustness. Selected variables represent biological mechanisms (e.g., SGLT2i nephroprotection, ACE/ARB benefits) with consistent effects across ethnicities demonstrated in international trials (CREDENCE, DAPA‐CKD included Middle Eastern sites). For socioeconomic measures, we performed sensitivity analysis excluding these variables (AUROC change <0.005), confirming minimal ethnic bias from this source.

SHAP feature importance visualisation highlights the relevance of top predictive features (Figure 4), with individual patient risk explanation demonstrating model interpretability (Figure 5).

Feature importance SHAP values diagram.

LIME explanation diagram of features.

Model validation and utilisation assessment

3.6

Detailed validation demonstrated excellent model performance across multiple metrics (Table 6). Decision curve revealed superior utilisation compared to treat‐all strategies across risk thresholds from 5% to 25%. Optimal net benefit was achieved at 15% risk thresholds (Δ + 0.024), translating to 22 events prevented per 1000 patients screened at 10% threshold and 12 events prevented per 1000 patients at 20% threshold.

Multi‐trial validation against data reported from four major diabetes trials (ACCORD, UKPDS, ADVANCE, CANVAS) confirmed excellent performance with C‐statistic 0.852 ± 0.005 and consistent calibration across populations. Bootstrap validation with 1000 bias‐corrected accelerated replicates confirmed minimal optimism (0.005) and maintained discrimination. Prior sensitivity analysis demonstrated model significance across odd and complex scenarios, with AUROC variation ≤0.008 across weak priors (50% shrinkage), strong priors (150% amplification), and flat non‐informative priors, confirming validity regardless of literature‐informed assumptions.

Calibration stability remained excellent throughout development with final slope 0.98 and intercept −0.012, closely around the ideal values (1.0, 0.0). Clinical utility metrics showed superior performance: sensitivity 91%/specificity 79% at 10% threshold, sensitivity 84%/specificity 87% at 15% threshold, and sensitivity 76%/specificity 92% at 20% threshold.

We validated imputation accuracy through several methods, first through convergence diagnostics, in which trace plots showed stable mixing, Gelman–Rubin statistics R̂ ≤1.01 for all variables, effective sample sizes over 10 000 for all parameters. Then, distribution validation compared imputed value distributions to literature‐reported distributions (e.g., family history of CKD prevalence: imputed 21.2% vs. literature 21.8%), with all imputed variables within 5% of literature estimates. Posterior predictive checks were approached through generation of datasets from posterior distributions, then compared summary statistics to observed data (Bayesian P‐value = 0.48, indicating good fit). For sensitivity analyses, we performed varied prior strength (50%, 100%, 150% of literature effect sizes), with maximum AUROC variation of 0.008 across scenarios. In addition to that, we utilised complete case comparisons for variables with some observed data, compared imputed vs. observed values (mean absolute error: 12% for medication adherence estimates, correlation r = 0.73). Our approached validation steps confirmed that literature‐informed imputation produced plausible values consistent with external evidence and internal data patterns.

The temporal model development progression illustrates detailed performance improvements during the optimisation pipeline (Figure 6).

Model development progression for temporal performance metrics.

Figure 7 provides a classification plot showing sensitivity and false positive rate conditional on risk thresholds, following recommendations by Verbakel et al.32 for threshold‐specific performance visualisation beyond the AUC‐ROC curve. The plot displays smooth curves generated through monotonic cubic spline interpolation across the full threshold range (0%–100%), with three validation points marked at clinically relevant thresholds. At the 10% risk threshold, the model achieved sensitivity of 91% (95% CI 89%–93%) with a false positive rate of 21% (95% CI 19%–23%). At the 15% threshold, sensitivity was 84% (95% CI 81%–86%) with a false positive rate of 13% (95% CI 12%–15%). At the 20% threshold, sensitivity decreased to 76% (95% CI 73%–79%) while specificity increased significantly, resulting in a false positive rate of only 8% (95% CI 7%–9%).

Model classification plot.

Model deployment and real‐world implementation

3.7

The final validated model was successfully deployed as an interactive web‐based application named PSMMC NephraRisk (https://nephrarisk.streamlit.app/) using the Streamlit framework. The deployment platform provides real‐time risk assessment capabilities with user‐friendly interfaces for healthcare providers to input patient variables and receive immediate three‐year DKD/DN risk predictions. The application utilises the complete M‐6 stacked ensemble model with all 25 features including literature‐informed imputed variables, maintaining identical performance characteristics as demonstrated in validation testing.

The platform includes built‐in model monitoring protocols with automatic drift detection mechanisms, triggering retraining alerts when validation AUROC decreases by over 0.05 or calibration slope falls outside the 0.85–1.15 range. Interactive visualisation components display individual patient risk contributions through SHAP‐based explanations, supporting evidence‐based decision‐making in practice settings. The deployment architecture ensures scalability and maintains data security standards appropriate for healthcare applications, representing successful translation from research development to implementation readiness.

Individual prediction stability

3.8

Individual prediction stability was excellent across the test cohort. The median 95% prediction interval width was 3.2 percentage points (IQR 2.1–4.8), indicating high precision in patient‐level estimates. Exactly 78% of patients had prediction interval widths of <5 percentage points, which was classified as high stability; 19% of patients had widths ranging between 5 and 10 percentage points, classified as moderate stability, and only 3% of patients had widths <10 percentage points, which were classified as low stability. Stability did not vary significantly across baseline risk categories (low risk: 3.1 pp., moderate risk: 3.4 pp., high risk: 3.0 pp., P‐value = 0.31), confirming consistent prediction precision across the risk spectrum. Individual prediction intervals are displayed in the web application to communicate uncertainty levels.

DISCUSSION

4

DKD and DN represent among the most serious complications of diabetes mellitus, affecting around 40% of diabetic patients all over the world and serving as leading causes of ESRD. Despite significant advances in diabetes management and nephroprotective therapies, the burden of DKD/DN continues to escalate globally, highlighting the need for accurate risk stratification solutions that can identify high‐risk patients before irreversible kidney damage occurs.33, 34, 35, 36

Current practice relies on utilisation of markers such as eGFR and albuminuria for DKD/DN risk assessment; however, these methods often fail to capture the complex interplay of demographic and social determinants that impact kidney disease progression. Most current predictive ML models demonstrate moderate discrimination performance with AUROC values ranging from 0.65 to 0.75 and frequently have poor calibration, limiting their effective utilisation for individual patient risk estimation.35, 37, 38, 39, 40, 41, 42

Our study successfully developed and validated a literature‐informed ensemble ML model for three‐year DKD/DN risk prediction using data from a major registry for diabetic patients in Saudi Arabia. Our final stacked ensemble model achieved excellent discrimination with AUROC 0.866 for the ensemble model and C‐statistic 0.852 for the clinical implementation, representing significant improvement over existing methods from previous literature studies. The model demonstrated near‐perfect calibration with slope 0.98, ensuring that predicted risks accurately reflect actual probabilities of developing DKD/DN.

The integration of literature‐informed imputed variables through Bayesian multiple imputation dominated as an innovative approach, contributing significantly to the model's predictive power despite these variables being unavailable in routine practice settings. Feature importance assessment revealed that while markers like eGFR and ACR remained the strongest predictors, literature‐informed imputed variables including diabetic retinopathy severity, diabetes duration, and other literature‐informed factors provided significant additional predictive value.

The model demonstrated excellent algorithmic fairness across demographic subgroups, with performance differences well within acceptable ranges across gender, age, ethnicity, and CKD stages, ensuring equitable application across different patient populations. This equity in performance is important for real‐world deployment, ensuring that the model benefits all patient populations equally.

The discrimination performance achieved in our study at AUROC 0.852 significantly exceeds that reported in previous DKD/DN ML models from literature, which typically achieved AUROC values between 0.65 and 0.75. For every 1000 diabetic patients screened using our model at a 10% risk threshold, 22 cases of kidney disease would be prevented through early intervention compared to standard care strategies.

The excellent calibration achieved at a slope of 0.98 and an intercept of −0.012 addresses a limitation of existing models. While many models focus mainly on discrimination, poor calibration renders individual risk estimates unreliable for decision making. Our model's excellent calibration means that when the model predicts a 15% 3‐year DKD/DN risk, 15% of similar patients will actually develop DKD/DN, allowing for confident management decisions and patient counselling.

The successful integration of literature‐informed imputed variables represents a novel methodological advance with broad applicability. Most of the previous predictive models are constrained by variables available in routine databases, often missing important risk factors such as detailed family history, medication exposure patterns, and socioeconomic determinants. Our utilised approach demonstrates that external evidence can be integrated and included through Bayesian imputation, expanding the scope of prediction models without requiring additional data collection.43, 44, 45

The successful deployment as an interactive web platform PSMMC NephraRisk demonstrates the translation from research focus to practice settings application on a wider term. The platform provides immediate risk assessment with interpretable explanations through SHAP values, allowing us to understand which factors drive individual patient risk. This interpretability is essential for practice‐settings adoption and patient communication, moving beyond black box predictions to transparently offer understandable insights for both physicians and patients.

Several limitations should be acknowledged when interpreting our findings. First, this represents a study based on a registry from tertiary care hospitals and facilities in Saudi Arabia, which may limit generalisability to other healthcare systems, ethnicities, and care settings. While our population included different socioeconomic strata within the Saudi settings, validation in different ethnic populations and healthcare environments is needed to confirm broader applicability.

Second, the retrospective design limits our ability to formulate causality and may introduce selection bias through differential patterns of follow‐up and testing. Patients with more severe diabetes or complications may have more frequent laboratory monitoring, which could possibly be affecting the outcome ascertainment. Also, the three‐year follow‐up period may not capture longer‐term renal disease progression observations and further associated findings on a longer‐term basis.

Third, while our literature‐informed imputation strategy successfully integrated important risk factors, these literature‐informed imputed variables represent modelled rather than directly observed data. However, we demonstrated validity and significance across different prior assumptions; the accuracy of imputed values depends on the applicability of external study findings to our population. Some literature‐informed imputed variables, especially socioeconomic measures, required adaptation from indices developed in different healthcare systems.

Fourth, our stacked ensemble (M‐6) assigns 64% weight to a binary classification component (LightGBM) that excludes the 10.1% of patients with administrative censoring before 36 months. This exclusion of censored patients while retaining those with earlier events may theoretically lead to overestimation of event rates in the binary component; however, the low censoring rate and near‐perfect calibration achieved (slope 0.98, intercept −0.012) suggest this bias is minimal in our cohort. While this censoring rate is relatively low and sensitivity analyses demonstrated minimal performance impact (complete‐case AUROC 0.859 vs. full‐cohort 0.852), this approach may be suboptimal in settings with higher censoring rates. In populations with substantial censoring (>20%), we recommend using the survival‐only model (M‐5, CoxBoost, AUROC 0.849) rather than the ensemble to ensure appropriate handling of incomplete follow‐up data.

Fifth, certain possible important predictors were not available in our dataset, including genetic markers and detailed patient‐reported outcomes such as exact drug class, dosing, frequency, and other considerations that we were unable to integrate successfully into our model given the inherent limitations of our ensemble‐based methodology that does not capture all data with certain proposed limitations. The model's performance might be further improved by integrating these additional risk factors when available; however, the current literature evidence regarding these datapoints was not included in our model development as we found them classified as lower‐quality studies that could introduce the risk of certain biases to our model transparency, so we avoided including them.

Also, it is important to mention that while we implemented temporal validation to simulate real‐world deployment, the model requires prospective validation to confirm performance in actual practice in real‐world settings on a better basis. As possible changes in care patterns, population demographics, or disease prevalence over time may affect model performance and necessitate periodic recalibration.

Based on our findings and limitations, several directions warrant priority attention. First, external validation studies should be conducted across multiple different healthcare organisations and healthcare systems, either locally in Saudi Arabia, the Middle East region, or internationally from different healthcare systems all over the world, ethnic populations, and geographic regions to assess model generalisability and identify population‐specific modifications. Special attention should be given to validating performance in healthcare systems with different diabetes management protocols and patient populations with varying baseline risks and different management protocols.

Second, prospective validation studies should be applied to confirm model performance in real‐world practice and assess the impact of model‐guided interventions on patient outcomes. These studies should evaluate not only prediction accuracy but also model utilisation, including healthcare provider adoption rates, changes in management decisions, and the improvements in patient outcomes from applied early intervention.

Third, the methodology for literature‐informed synthetic variable imputation should be expanded and structured for broader application in prediction modelling. Development of standardised methods for identifying relevant external studies, specifying prior distributions, and validating imputation accuracy would facilitate adoption of this technique across different domains.

Fourth, integration with newer additional data sources should be explored, including continuous glucose monitoring data, wearable device metrics, electronic health record natural language processing, and genomic information. These additional data streams may further improve prediction accuracy and allow for more personalised risk assessment, if collected on a high‐quality proper basis.

Fifth, implementation‐based studies should investigate additional strategies for deploying ML‐based risk prediction tools in practice settings, including healthcare workers' training needs, workflow integration challenges, and patient communication strategies. Understanding barriers to adoption and developing effective implementation strategies will be of significant importance for translating the studies' advances into improved patient care.

Finally, long‐term studies should assess the impact of model‐guided risk stratification on healthcare costs, resource utilisation, and patient quality of life. Economic evaluations will be essential for supporting healthcare system adoption and policy decisions regarding ML‐based decision support systems.

Our study demonstrates that integrated ML approaches combined with risk factor integration can significantly improve DKD/DN risk prediction accuracy and utilisation. The successful deployment as an interactive platform provides a solid foundation for broader implementation and continued refinement based on real‐world experience. These advances represent important steps toward personalised, data‐driven approaches to DKD/DN prevention and management.

CONCLUSIONS

5

Our study successfully developed and validated a literature‐informed ensemble‐based ML model for 3‐year DKD/DN risk prediction that significantly advances current prediction capabilities. The final stacked ensemble model achieved excellent discrimination of AUROC 0.866 and the clinical implementation achieved C‐statistic 0.852 with near‐perfect calibration of slope 0.98, translating to significant clinical utility with 22 kidney disease events prevented per 1000 patients screened. The innovative integration of literature‐informed imputed variables through Bayesian MICE expanded the model's predictive scope beyond routinely available data, while multi‐trial validation demonstrated significant generalisability across different populations and treatment manners. Excellent algorithmic fairness across demographic subgroups ensures equitable application, while the successful deployment as an interactive web platform demonstrates practical implementation readiness. Our proposed methodology and framework provide a foundation for broader implementation of evidence‐driven risk stratification in diabetes care, with promising possibilities for adaptation across different healthcare systems and clinical domains. Further validation in different populations, settings, healthcare systems and prospective clinical studies will be important to fully realise the model's potential for improving DKD/DN prevention and patient outcomes.

AUTHOR CONTRIBUTIONS

A.M.T. contributed to conceptualisation, methodology, data curation, formal analysis, and writing of the original draft; T.J.A. contributed to data curation, validation, methodology, and review and editing of the manuscript; A.A.A. contributed to software development, data curation, formal analysis, and visualisation; I.M.Y. contributed to methodology, validation, formal analysis, and review and editing; F.S.A. contributed to software development, visualisation, data curation, and methodology; A.Y.A. contributed to modelling development, study pipeline, machine learning expertise, software development, conceptualisation, methodology, supervision, project administration, validation, formal analysis, writing of the original draft, review and editing, and correspondence for the entire clinical development, scientific development and property for the framework pipeline development. All authors read and approved the final manuscript.

FUNDING INFORMATION

This study received no specific grant from any funding agency in the public, commercial, or not‐for‐profit sectors.

CONFLICT OF INTEREST STATEMENT

The authors declare that they have no competing interests.

ETHICS STATEMENT

The study protocol was approved by the institutional review board of Prince Sultan Military Medical City (PSMMC). Informed consent was waived for this registry‐based analysis due to the retrospective nature of the study and use of de‐identified data.

CONSENT

The authors have nothing to report.

Supporting information

Table S1. Complete feature definitions and specifications.

Table S2. Individual prediction stability assessment.

Table S3. Detailed sensitivity analysis results.

Bibliography72

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Janota‐Sosińska O , Mantovani M , Irlik K , et al. Diabetic kidney disease phenotypes and the risk of cardiovascular events: the Silesia diabetes‐heart project. Cardiovasc Diabetol. 2025;24:305. doi:10.1186/s 12933-025-02852-z 40739504 PMC 12312546 · doi ↗ · pubmed ↗
2Mc Donnell T , Kalra PA , Vuilleumier N , et al. The impact of primary renal diagnosis on prognosis and the varying predictive power of albuminuria in the NUR Tu RE‐CKD study. Am J Nephrol. 2025;56:1‐12. doi:10.1159/000541770 39369692 PMC 11812588 · doi ↗ · pubmed ↗
3Kalhan TA , Luo M , Chai JH , et al. Health economic evaluation of a risk‐stratified intervention in diabetic kidney disease. Diabetologia. 2025;68(10):2227‐2239. doi:10.1007/s 00125-025-06498-0 40739365 · doi ↗ · pubmed ↗
4Helou N , Dwyer A , Shaha M , Zanchi A . Multidisciplinary management of diabetic kidney disease: a systematic review and meta‐analysis. JBI Database System Rev Implement Rep. 2016;14:169‐207. doi:10.11124/jbisrir-2016-003011\27532796 · doi ↗ · pubmed ↗
5Allen A , Iqbal Z , Green‐Saxena A , et al. Prediction of diabetic kidney disease with machine learning algorithms, upon the initial diagnosis of type 2 diabetes mellitus. BMJ Open Diabetes Res Care. 2022;10(1):e 002560. doi:10.1136/bmjdrc-2021-002560 PMC 877242535046014 · doi ↗ · pubmed ↗
6Jiang S , Xu L , Li C , et al. Development and validation of risk prediction models for acute kidney disease in gout patients: a retrospective study using machine learning. Eur J Med Res. 2025;30:660. doi:10.1186/s 40001-025-02939-z 40702529 PMC 12285073 · doi ↗ · pubmed ↗
7White N , Parsons R , Collins G , Barnett A . Evidence of questionable research practices in clinical prediction models. BMC Med. 2023;21:339. doi:10.1186/s 12916-023-03048-6 37667344 PMC 10478406 · doi ↗ · pubmed ↗
8Chen L , Shao X , Yu P . Machine learning prediction models for diabetic kidney disease: systematic review and meta‐analysis. Endocrine. 2024;84:890‐902. doi:10.1007/s 12020-023-03637-8 38141061 · doi ↗ · pubmed ↗