Evaluating Large Vision-language Models for Surgical Tool Detection
Nakul Poudel, Richard Simon, and Cristian A. Linte

TL;DR
This paper systematically evaluates large vision-language models for surgical tool detection, demonstrating that Qwen2.5 outperforms other models in zero-shot and fine-tuned settings, advancing AI's role in surgical scene understanding.
Contribution
It provides a comprehensive assessment of VLMs in surgical tool detection, highlighting Qwen2.5's superior performance and generalization capabilities over existing models.
Findings
Qwen2.5 achieves the best detection accuracy among evaluated VLMs.
Qwen2.5 shows strong zero-shot generalization compared to Grounding DINO.
Grounding DINO excels in localization tasks.
Abstract
Surgery is a highly complex process, and artificial intelligence has emerged as a transformative force in supporting surgical guidance and decision-making. However, the unimodal nature of most current AI systems limits their ability to achieve a holistic understanding of surgical workflows. This highlights the need for general-purpose surgical AI systems capable of comprehensively modeling the interrelated components of surgical scenes. Recent advances in large vision-language models that integrate multimodal data processing offer strong potential for modeling surgical tasks and providing human-like scene reasoning and understanding. Despite their promise, systematic investigations of VLMs in surgical applications remain limited. In this study, we evaluate the effectiveness of large VLMs for the fundamental surgical vision task of detecting surgical tools. Specifically, we investigate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Surgical Simulation and Training · Advanced Neural Network Applications
