SurgXBench: Explainable Vision-Language Model Benchmark for Surgery

Jiajun Cheng; Xianwu Zhao; Sainan Liu; Xiaofan Yu; Ravi Prakash; Patrick J. Codd; Jonathan Elliott Katz; Shan Lin

arXiv:2505.10764·cs.CV·July 24, 2025

SurgXBench: Explainable Vision-Language Model Benchmark for Surgery

Jiajun Cheng, Xianwu Zhao, Sainan Liu, Xiaofan Yu, Ravi Prakash, Patrick J. Codd, Jonathan Elliott Katz, Shan Lin

PDF

Open Access

TL;DR

This paper benchmarks the zero-shot performance of vision-language models in robotic surgery, introduces explainability tools to interpret model decisions, and highlights the need for improved visual reasoning in surgical AI systems.

Contribution

It provides a comprehensive benchmark of advanced VLMs on surgical datasets and integrates explainability analysis to assess model reliability and limitations.

Findings

01

VLMs often rely on weak contextual cues rather than visual evidence.

02

Benchmark reveals limited zero-shot generalization in surgical VLMs.

03

Explainability metrics uncover causal explanations behind model predictions.

Abstract

Innovations in digital intelligence are transforming robotic surgery with more informed decision-making. Real-time awareness of surgical instrument presence and actions (e.g., cutting tissue) is essential for such systems. Yet, despite decades of research, most machine learning models for this task are trained on small datasets and still struggle to generalize. Recently, vision-Language Models (VLMs) have brought transformative advances in reasoning across visual and textual modalities. Their unprecedented generalization capabilities suggest great potential for advancing intelligent robotic surgery. However, surgical VLMs remain under-explored, and existing models show limited performance, highlighting the need for benchmark studies to assess their capabilities and limitations and to inform future development. To this end, we benchmark the zero-shot performance of several advanced VLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRadiomics and Machine Learning in Medical Imaging · Colorectal Cancer Screening and Detection