Evaluating Large Vision-language Models for Surgical Tool Detection

Nakul Poudel; Richard Simon; and Cristian A. Linte

arXiv:2601.16895·cs.CV·January 26, 2026

Evaluating Large Vision-language Models for Surgical Tool Detection

Nakul Poudel, Richard Simon, and Cristian A. Linte

PDF

Open Access

TL;DR

This paper systematically evaluates large vision-language models for surgical tool detection, demonstrating that Qwen2.5 outperforms other models in zero-shot and fine-tuned settings, advancing AI's role in surgical scene understanding.

Contribution

It provides a comprehensive assessment of VLMs in surgical tool detection, highlighting Qwen2.5's superior performance and generalization capabilities over existing models.

Findings

01

Qwen2.5 achieves the best detection accuracy among evaluated VLMs.

02

Qwen2.5 shows strong zero-shot generalization compared to Grounding DINO.

03

Grounding DINO excels in localization tasks.

Abstract

Surgery is a highly complex process, and artificial intelligence has emerged as a transformative force in supporting surgical guidance and decision-making. However, the unimodal nature of most current AI systems limits their ability to achieve a holistic understanding of surgical workflows. This highlights the need for general-purpose surgical AI systems capable of comprehensively modeling the interrelated components of surgical scenes. Recent advances in large vision-language models that integrate multimodal data processing offer strong potential for modeling surgical tasks and providing human-like scene reasoning and understanding. Despite their promise, systematic investigations of VLMs in surgical applications remain limited. In this study, we evaluate the effectiveness of large VLMs for the fundamental surgical vision task of detecting surgical tools. Specifically, we investigate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Surgical Simulation and Training · Advanced Neural Network Applications