# Multimodal Large Language Models for Cystoscopic Image Interpretation and Bladder Lesion Classification: Comparative Study

**Authors:** Yung-Chi Shih, Cheng-Yang Wu, Shi-Wei Huang, Chung-You Tsai

PMC · DOI: 10.2196/87193 · 2026-01-28

## TL;DR

This study compares multimodal large language models for interpreting cystoscopy images and classifying bladder lesions, finding that OpenAI-o3 performs best but still needs improvement for complex diagnoses.

## Contribution

The study introduces a rigorous evaluation of MM-LLMs for cystoscopy using clinician-defined stress-test datasets and assesses interpretive and classification capabilities.

## Key findings

- OpenAI-o3 achieved the highest lesion detection accuracy (88.3%) and balanced sensitivity-specificity trade-off in tumor-like lesion classification.
- In-context learning improved OpenAI-o3's microaverage accuracy from 40.7% to 46.0%.
- MM-LLMs show potential for assisting in cystoscopy interpretation but require optimization for complex differential diagnoses.

## Abstract

Cystoscopy remains the gold standard for diagnosing bladder lesions; however, its diagnostic accuracy is operator dependent and prone to missing subtle abnormalities such as carcinoma in situ or misinterpreting mimic lesions (tumor, inflammation, or normal variants). Artificial intelligence–based image-analysis systems are emerging, yet conventional models remain limited to single tasks and cannot produce explanatory reports or articulate diagnostic reasoning. Multimodal large language models (MM-LLMs) integrate visual recognition, contextual reasoning, and language generation, offering interpretive capabilities beyond conventional artificial intelligence.

This study aims to rigorously evaluate state-of-the-art MM-LLMs for cystoscopic image interpretation and lesion classification using clinician-defined stress-test datasets enriched with rare, diverse, and challenging lesions, focusing on diagnostic accuracy, reasoning quality, and clinical relevance.

Four MM-LLMs (OpenAI-o3 and ChatGPT-4o [OpenAI]; Gemini 2.5 Pro and MedGemma-27B [Google]) were evaluated under blinded, randomized procedures across two tasks: (1) free-text image interpretation for anatomic site, findings, lesion reasoning, and final diagnosis (n=401) and (2) seven-class tumor-like lesion classification (n=113) within a multiple-choice framework (cystitis, polyps, papilloma, papillary urothelial carcinoma, carcinoma in situ, non-urothelial carcinoma, and none of the above). Three raters independently scored outputs using a 5-point Likert scale, and classification metrics (accuracy, sensitivity, specificity, Youden J index (Youden J), and Matthews correlation coefficient [MCC]) were calculated for lesion detection, biopsy indication, and malignancy endpoints. For optimization, model performance was compared between zero-shot and text-based in-context learning prompts that were prefixed with brief descriptions of tumor features.

The 401-image test set spanned 40 subcategories, with 322 (80.3%) containing abnormal findings in the image interpretation task. OpenAI-o3 demonstrated strong reasoning, with high satisfaction for anatomy (339/401, 84.5%) and findings (305/401, 76%), but lower satisfaction for lesion reasoning (211/401, 52.5%) and final diagnosis (193/401, 48.2%), indicating increasing difficulty with higher-order synthesis. Mean Likert score differences (OpenAI-o3 minus Gemini 2.5 Pro) were +0.27 for findings (adjusted P value: q=0.002), +0.24 for lesion reasoning (q=0.047), and +0.19 for final diagnosis. For clinically relevant endpoints in the full set, OpenAI-o3 achieved the most balanced performance, with lesion detection accuracy of 88.3%, sensitivity of 92%, specificity of 73.1%, Youden J of 0.650, and MCC of 0.635. In 7-class tumor-like lesion classification, OpenAI-o3 achieved accuracies of 73.5% for biopsy indication and 62.8% for malignancy, with a balanced sensitivity-specificity trade-off, outperforming other models. Notably, OpenAI-o3 performed best on prevalent malignant lesions. ChatGPT-4o and Gemini 2.5 Pro showed high sensitivity but low specificity, whereas MedGemma-27B underperformed. In-context learning improved OpenAI-o3 microaverage accuracy (40.7%→46.0%; MCC 0.311→0.370) but yielded only slight specificity gains and minimal accuracy change in other models, likely constrained by the absence of paired image-text context.

MM-LLMs demonstrate meaningful assistive potential in generating interpretable cystoscopy free-text rationales and supporting biopsy triage and training. However, performance in difficult differential diagnoses remains modest and requires further optimization before safe clinical integration.

## Linked entities

- **Diseases:** carcinoma in situ (MONDO:0004647), cystitis (MONDO:0006032)

## Full-text entities

- **Diseases:** cystitis (MESH:D003556), inflammation (MESH:D007249), polyps (MESH:D011127), lesion (MESH:D009059), carcinoma in situ (MESH:D002278), malignancy (MESH:D009369), non-urothelial carcinoma (MESH:D014523), Bladder Lesion (MESH:D001745), papillary urothelial carcinoma (MESH:D002291), papilloma (MESH:D010212)
- **Chemicals:** Gemini (-)

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12895159/full.md

---
Source: https://tomesphere.com/paper/PMC12895159