Surgical-LVLM: Learning to Adapt Large Vision-Language Model for   Grounded Visual Question Answering in Robotic Surgery

Guankun Wang; Long Bai; Wan Jun Nah; Jie Wang; Zhaoxi Zhang; Zhen; Chen; Jinlin Wu; Mobarakol Islam; Hongbin Liu; and Hongliang Ren

arXiv:2405.10948·cs.CV·March 18, 2025·5 cites

Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery

Guankun Wang, Long Bai, Wan Jun Nah, Jie Wang, Zhaoxi Zhang, Zhen, Chen, Jinlin Wu, Mobarakol Islam, Hongbin Liu, and Hongliang Ren

PDF

Open Access

TL;DR

Surgical-LVLM is a novel large vision-language model designed for complex surgical visual question answering and grounding, improving understanding of intricate visual-language tasks in robotic surgery.

Contribution

It introduces a personalized LVLM with VP-LoRA blocks and the TIT module, enhancing complex scenario understanding and visual grounding in surgical contexts.

Findings

01

Sets new performance benchmarks on EndoVis datasets.

02

Effectively models long-range dependencies in surgical VQA.

03

Improves multimodal alignment in complex surgical scenes.

Abstract

Recent advancements in Surgical Visual Question Answering (Surgical-VQA) and related region grounding have shown great promise for robotic and medical applications, addressing the critical need for automated methods in personalized surgical mentorship. However, existing models primarily provide simple structured answers and struggle with complex scenarios due to their limited capability in recognizing long-range dependencies and aligning multimodal information. In this paper, we introduce Surgical-LVLM, a novel personalized large vision-language model tailored for complex surgical scenarios. Leveraging the pre-trained large vision-language model and specialized Visual Perception LoRA (VP-LoRA) blocks, our model excels in understanding complex visual-language tasks within surgical contexts. In addressing the visual grounding task, we propose the Token-Interaction (TIT) module, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques