Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study

Leon Mayer; Tim R\"adsch; Dominik Michael; Lucas Luttner; Amine Yamlahi; Evangelia Christodoulou; Patrick Godau; Marcel Knopp; Annika Reinke; Fiona Kolbinger; Lena Maier-Hein

arXiv:2506.06232·cs.CV·July 9, 2025

Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study

Leon Mayer, Tim R\"adsch, Dominik Michael, Lucas Luttner, Amine Yamlahi, Evangelia Christodoulou, Patrick Godau, Marcel Knopp, Annika Reinke, Fiona Kolbinger, Lena Maier-Hein

PDF

Open Access

TL;DR

This study evaluates the capabilities of vision-language models in endoscopic surgery, revealing strengths in basic perception tasks but limitations in medical knowledge and highlighting the need for specialized model development.

Contribution

First large-scale benchmarking of VLMs on surgical data, comparing generalist and medical models across basic and advanced endoscopic tasks.

Findings

01

VLMs perform well on basic perception tasks like object counting.

02

Performance drops on tasks requiring medical knowledge.

03

Specialized medical VLMs underperform compared to generalist models.

Abstract

While traditional computer vision models have historically struggled to generalize to endoscopic domains, the emergence of foundation models has shown promising cross-domain performance. In this work, we present the first large-scale study assessing the capabilities of Vision Language Models (VLMs) for endoscopic tasks with a specific focus on laparoscopic surgery. Using a diverse set of state-of-the-art models, multiple surgical datasets, and extensive human reference annotations, we address three key research questions: (1) Can current VLMs solve basic perception tasks on surgical images? (2) Can they handle advanced frame-based endoscopic scene understanding tasks? and (3) How do specialized medical VLMs compare to generalist models in this context? Our results reveal that VLMs can effectively perform basic surgical perception tasks, such as object counting and localization, with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Surgical Simulation and Training