Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs
Jincai Huang, Shihao Zou, Yuchen Guo, Jingjing Li, Wei Ji, Kai Wang, Shanshan Wang, Weixin Si

TL;DR
This paper introduces SurgMLLM, a unified framework that combines high-level reasoning and pixel-level grounding for comprehensive surgical scene understanding, advancing the accuracy and consistency of surgical scene interpretation.
Contribution
The paper presents SurgMLLM, the first model to jointly perform reasoning and grounding in surgical scenes, trained end-to-end with a new dataset extension for pixel-level annotations.
Findings
Improved triplet recognition AP_IVT from 40.7% to 46.0%.
Outperforms prior methods in phase recognition.
Achieves accurate pixel-wise grounding of instruments and targets.
Abstract
Surgical scene understanding is a cornerstone of computer-assisted intervention. While recent advances, particularly in surgical image segmentation, have driven progress, real-world clinical applications require a more holistic understanding that jointly captures procedural context, semantic reasoning, and precise visual grounding. However, existing approaches typically address these components in isolation, leading to fragmented representations and limited semantic consistency. To address this limitation, we propose SurgMLLM, a unified surgical scene understanding framework that bridges high-level reasoning and low-level visual grounding within a single model. Given surgical videos, SurgMLLM fine-tunes a multimodal large language model (MLLM) to support structured interpretability reasoning, which is used to jointly model phases, instrument-verb-target (IVT) triplets, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
