SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model

Guankun Wang; Junyi Wang; Wenjin Mo; Long Bai; Kun Yuan; Ming Hu; Jinlin Wu; Junjun He; Yiming Huang; Nicolas Padoy; Zhen Lei; Hongbin Liu; Nassir Navab; and Hongliang Ren

arXiv:2506.17873·cs.CV·February 4, 2026

SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model

Guankun Wang, Junyi Wang, Wenjin Mo, Long Bai, Kun Yuan, Ming Hu, Jinlin Wu, Junjun He, Yiming Huang, Nicolas Padoy, Zhen Lei, Hongbin Liu, Nassir Navab, and Hongliang Ren

PDF

TL;DR

SurgVidLM introduces a novel large language model tailored for comprehensive and detailed understanding of surgical videos, leveraging a new dataset and multi-frequency attention to improve scene perception and task analysis.

Contribution

This work presents SurgVidLM, the first video language model for multi-grained surgical video understanding, supported by a large-scale dataset and innovative multi-frequency fusion attention mechanisms.

Findings

01

Outperforms existing models in full and fine-grained surgical video tasks

02

Demonstrates superior ability to capture complex surgical contexts

03

Provides a new dataset for surgical video analysis

Abstract

Surgical scene understanding is critical for surgical training and robotic decision-making in robot-assisted surgery. Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated great potential for advancing scene perception in the medical domain, facilitating surgeons to understand surgical scenes and procedures. However, these methods are primarily oriented towards image-based analysis or global video understanding, overlooking the fine-grained video reasoning that is crucial for analyzing specific processes and capturing detailed task execution within a surgical procedure. To bridge this gap, we propose SurgVidLM, the first video language model designed to address both full and fine-grained surgical video comprehension. To train our SurgVidLM, we construct the SVU-31K that is a large-scale dataset with over 31K video-instruction pairs, enabling both holistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.