# Explainable machine learning for early detection of Escherichia coli urinary tract infections: integrating SHAP interpretation and bacterial epidemiology

**Authors:** Jie Zhang, Ying-Ying Jiang, Ying Zhu, Chu-Ying Pan, Ling-Hui Yao, Ying-Ying Zheng, Shi-Yan Zhang, Jinbao Shi

PMC · DOI: 10.3389/fcimb.2026.1740707 · Frontiers in Cellular and Infection Microbiology · 2026-02-13

## TL;DR

This study developed an explainable machine learning model to quickly identify E. coli urinary tract infections using routine clinical data, aiming to speed up diagnosis.

## Contribution

The novel integration of SHAP interpretation with bacterial epidemiology for rapid, culture-independent E. coli UTI detection.

## Key findings

- E. coli was the most common uropathogen, found in 51.3% of UTI cases.
- The Random Forest model achieved moderate discrimination (ROC-AUC = 0.66) using routine lab variables.
- SHAP analysis identified sex, lymphocyte count, and ALT as key predictors for E. coli UTI.

## Abstract

Escherichia coli is the predominant uropathogen in urinary tract infections (UTIs), but culture-based identification is time-consuming. This study aimed to develop an explainable, culture-independent model to distinguish E. coli from other uropathogens using routinely collected clinical data.

We retrospectively analyzed 308 hospitalized patients with culture-confirmed UTIs at Fuding Hospital, Fujian University of Traditional Chinese Medicine (January–December 2023), classified as E. coli (n = 158) or non–E. coli (n = 150). Species identification was performed using an automated microbiology system. Nineteen predictors (sex, urinary leukocyte grade, and 17 routine laboratory variables) were used. Associations with E. coli UTI were examined using univariate and multivariable logistic regression. A Random Forest (RF) classifier was developed with SHapley Additive exPlanations (SHAP) for interpretability. Data were split using a stratified 70/30 train–test split; 5-fold stratified cross-validation within the training set was used for hyperparameter tuning, and final performance (discrimination and calibration) was reported on the held-out test set. RF was additionally benchmarked against regularized logistic regression, calibrated linear SVM, and gradient boosting using the same protocol.

E. coli accounted for 51.3% of isolates, followed by Enterococcus spp. (18.5%) and Klebsiella spp. (7.8%). Compared with non–E. coli cases, E. coli infections were more common in females and showed higher lymphocyte counts (LYM), alanine aminotransferase (ALT), and albumin (ALB) (all P < 0.05). Multivariable logistic regression identified sex, LYM, and urinary leukocyte grade as independent predictors. On the held-out test set, RF achieved moderate discrimination (ROC-AUC = 0.66; average precision = 0.66) with calibration assessed by Brier score and calibration slope. SHAP highlighted Sex, LYM, and ALT as the most influential predictors and revealed patient-level heterogeneity in feature effects.

E. coli remains the predominant pathogen among hospitalized UTIs. An explainable RF model using routine laboratory variables provided moderate, reproducible discrimination of E. coli vs non–E. coli UTIs and may support earlier decision-making while awaiting culture results.

## Linked entities

- **Species:** Escherichia coli (taxon 562)

## Full-text entities

- **Genes:** CRP [NCBI Gene 20468888]
- **Diseases:** febrile infections (MESH:D007239), bacteremia (MESH:D016470), E. coli infection (MESH:D004927), UTIs (MESH:D014552), thrombotic (MESH:D013927), HGB (MESH:D006445), human immunodeficiency virus (HIV) infection (MESH:D015658), lymphocytosis (MESH:D008218), sepsis (MESH:D018805), bacterial infections (MESH:D001424), hepatic injury (MESH:D056486), ALB (OMIM:194470), hepatic or renal failure (MESH:D017093), Klebsiella pneumoniae (MESH:D007710), malignancy (MESH:D009369), Inflammatory (MESH:D007249), pyelonephritis (MESH:D011704)
- **Chemicals:** FP (-), PCT (MESH:D011080), GLU (MESH:D005947), citrate (MESH:D019343), bilirubin (MESH:D001663), EDTA (MESH:D004492), Uric Acid (MESH:D014527), CHO (MESH:D002784)
- **Species:** Homo sapiens (human, species) [taxon 9606], Enterobacteriaceae (enterobacteria, family) [taxon 543], Acinetobacter baumannii (species) [taxon 470], Escherichia coli (E. coli, species) [taxon 562]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12946121/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12946121/full.md

## References

25 references — full list in the complete paper: https://tomesphere.com/paper/PMC12946121/full.md

---
Source: https://tomesphere.com/paper/PMC12946121