# Calibrated, explainable machine learning on routine laboratory data to characterize diagnostic assignment patterns in rheumatic diseases: a retrospective study of 12,085 patients

**Authors:** Amal Mohamed Elmesiry, Amira Shahin Ibrahim, Hemmat A. Elabd, Basma Mohamed El Naggar, Eman E. Abd Elsalam, Mai Abd El Halim Moussa, Eman A. Rageh, Mona Mokhtar, Muhammad M. Harb, Aya H. Elshazly, Mohamed A. Khalafallah, Atef A. Hassan

PMC · DOI: 10.1186/s41927-025-00607-7 · BMC Rheumatology · 2025-12-29

## TL;DR

This study uses machine learning on lab data to understand how rheumatic diseases are diagnosed, finding that some models can provide reliable, explainable probabilities but struggle with certain conditions like ankylosing spondylitis.

## Contribution

The study introduces a calibrated, explainable machine learning approach using routine lab data to characterize diagnostic patterns in rheumatic diseases.

## Key findings

- XGBoost achieved the highest test accuracy (85.48%) among models trained on routine lab data.
- Random Forest was selected for interpretation due to superior calibration, with SLE recall at 97.9% and AS recall at 57.6%.
- In seronegative patients, HLA-B27 prevalence was higher and anti-La prevalence was lower compared to the general group.

## Abstract

Overlap in routine laboratory profiles complicates differential diagnosis of rheumatic diseases, particularly seronegative spondyloarthritis. We examined whether models trained on routine labs reproduce diagnostic assignment patterns and yield calibrated, explainable probabilities.

We analyzed a publicly available, fully de-identified dataset (n = 12,085). Adults (≥ 18 years) with confirmed diagnoses and ≤ 30% biomarker missingness were included. Nineteen routine variables (demographics, ESR/CRP, serology) plus four engineered features were used. Missingness (~ 14.5%) was imputed using MICE, variables were standardized, and the data were split 80/20 with stratification. Random Forest, LightGBM, XGBoost, CatBoost, and TabNet were trained with fixed, literature-informed hyperparameters. We assessed 5-fold CV, independent test performance, calibration (Brier/ECE), and SHAP; a predefined seronegative subgroup (RF/anti-CCP negative) was analyzed.

XGBoost achieved the highest test accuracy (85.48%); Random Forest (83.78%) was selected for detailed interpretation due to superior calibration. Performance varied by disease: SLE recall was 97.9% compared to ankylosing spondylitis (AS), 57.6%. Among 2,417 test cases, 381 (15.76%) were misclassified; the most frequent error was AS misclassified as RA (109; 28.6%). SHAP ranked ESR/CRP, RF/anti-CCP, HLA-B27, and C3/C4 as dominant contributors. In seronegative patients (n = 390), the prevalence of HLA-B27 was higher (+ 6.5%; p = 0.024), and the prevalence of anti-La was lower (–11.6%; p = 0.001).

Routine laboratory data can be converted into calibrated, explainable probabilities that characterize diagnostic assignment patterns, rather than independent predictions. Given poor AS performance, the approach is not reliable for differentiating spondyloarthropathies from RA without additional clinical or imaging data. External/temporal validation, integration of clinical and imaging features, and prospective evaluation are needed.

Not applicable.

The online version contains supplementary material available at 10.1186/s41927-025-00607-7.

## Linked entities

- **Diseases:** ankylosing spondylitis (MONDO:0005306), SLE (MONDO:0007915), RA (MONDO:0005272)

## Full-text entities

- **Diseases:** rheumatic diseases (MESH:D012216)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12849087/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12849087/full.md

## References

1 references — full list in the complete paper: https://tomesphere.com/paper/PMC12849087/full.md

---
Source: https://tomesphere.com/paper/PMC12849087