GroundedSurg: A Multi-Procedure Benchmark for Language-Conditioned Surgical Tool Segmentation

Tajamul Ashraf; Abrar Ul Riyaz; Wasif Tak; Tavaheed Tariq; Sonia Yadav; Moloud Abdar; and Janibul Bashir

arXiv:2603.01108·cs.CV·March 3, 2026

GroundedSurg: A Multi-Procedure Benchmark for Language-Conditioned Surgical Tool Segmentation

Tajamul Ashraf, Abrar Ul Riyaz, Wasif Tak, Tavaheed Tariq, Sonia Yadav, Moloud Abdar, and Janibul Bashir

PDF

Open Access

TL;DR

GroundedSurg introduces a novel benchmark for evaluating language-conditioned, instance-level surgical tool segmentation across diverse procedures, emphasizing realistic clinical scenarios and the integration of vision-language reasoning.

Contribution

This work presents the first comprehensive dataset and benchmark for language-conditioned surgical grounding, enabling evaluation of models on instance-level localization with natural language descriptions.

Findings

01

Significant performance gaps in current models highlight the need for improved vision-language reasoning.

02

The benchmark covers diverse surgical procedures and instrument types, reflecting real-world complexity.

03

Evaluation reveals challenges in integrating linguistic and visual understanding in surgical AI.

Abstract

Clinically reliable perception of surgical scenes is essential for advancing intelligent, context-aware intraoperative assistance such as instrument handoff guidance, collision avoidance, and workflow-aware robotic support. Existing surgical tool benchmarks primarily evaluate category-level segmentation, requiring models to detect all instances of predefined instrument classes. However, real-world clinical decisions often require resolving references to a specific instrument instance based on its functional role, spatial relation, or anatomical interaction capabilities not captured by current evaluation paradigms. We introduce GroundedSurg, the first language-conditioned, instance-level surgical grounding benchmark. Each instance pairs a surgical image with a natural-language description targeting a single instrument, accompanied by structured spatial grounding annotations including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSurgical Simulation and Training · Multimodal Machine Learning Applications · Soft Robotics and Applications