# Artificial intelligence versus classical scoring systems: a comparative analysis of stone-free prediction after percutaneous nephrolithotomy

**Authors:** Burak Elmaağaç, Ali Yasin Özercan, Abdullah Gölbaşı, Hüseyin Biçer, Ercan Arslan, Mert Ali Karadağ

PMC · DOI: 10.1007/s00240-026-01946-x · 2026-01-30

## TL;DR

This study compares traditional stone scoring systems with a ChatGPT-based model in predicting stone-free outcomes after kidney stone surgery.

## Contribution

The study evaluates ChatGPT's predictive performance against established scoring systems for the first time in endourology.

## Key findings

- Traditional scoring systems like Guy’s Stone Score and S.T.O.N.E. outperformed ChatGPT in predicting stone-free rates.
- ChatGPT-based models showed limited predictive accuracy and failed to provide reliable estimates.
- Guy’s Stone Score and S.T.O.N.E. were identified as independent predictors of surgical success.

## Abstract

This study aimed to compare the predictive performance of traditional stone scoring systems with a large language model based on ChatGPT in estimating stone-free rates following percutaneous nephrolithotomy. A total of 340 patients who underwent the procedure between 2019 and 2025 were retrospectively analyzed. Preoperative stone complexity was evaluated using four established scoring systems—Guy’s Stone Score, the CROES nomogram, the S.T.O.N.E. nephrolithometry score, and the Seoul National University Renal Stone Complexity score—and each case was additionally processed through a ChatGPT-based prediction model. The predicted outcomes of each method were compared with actual postoperative results using correlation analysis and multivariate regression. The overall stone-free rate was 60.9%. Patients who achieved stone-free status had significantly lower Guy’s Stone Score, S.T.O.N.E., and S-ReSC values than those with residual stones (all p < 0.001). In contrast, neither the CROES nomogram (p = 0.19) nor the ChatGPT-based predicted stone-free probability (p = 0.549) differed significantly between the two groups. Univariate analysis revealed that higher values in Guy’s Stone Score, S.T.O.N.E., and S-ReSC scores were associated with stone-free failure. Multivariate analysis identified Guy’s Stone Score and S.T.O.N.E. score as independent predictors of surgical success. In contrast, the ChatGPT-based model showed limited predictive performance and failed to provide reliable estimates for stone-free rates in our study. These findings support the continued clinical utility of conventional scoring systems while emphasizing the need for further development and validation of artificial intelligence models. Large language models must be trained on structured clinical datasets and externally validated before their integration into surgical decision-making processes in endourology.

GSS, S.T.O.N.E., and S-ReSC scores predict SFR better than CROES and AI.

ChatGPT-based model showed weak accuracy in predicting stone-free status.

GSS and S.T.O.N.E. were independent predictors in multivariate analysis.

CROES nomogram and ChatGPT were not statistically significant predictors.

AI tools need refinement and validation for endourologic outcome prediction.

## Full-text entities

- **Diseases:** infection (MESH:D007239), hydronephrosis (MESH:D006869), Urinary stone disease (MESH:D014545), GSS (MESH:D007669), hypertension (MESH:D006973), stone-free failure (MESH:D051437), sepsis (MESH:D018805), hemorrhage (MESH:D006470), urological condition (MESH:D014570), diabetes mellitus (MESH:D003920), urolithiasis (MESH:D052878), postoperative complication (MESH:D011183), pelvicalyceal abnormalities (MESH:D000014), staghorn stones (MESH:D000069856), calculi (MESH:D002137)
- **Chemicals:** creatinine (MESH:D003404)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12858549/full.md

---
Source: https://tomesphere.com/paper/PMC12858549