Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery
Guankun Wang, Long Bai, Wan Jun Nah, Jie Wang, Zhaoxi Zhang, Zhen, Chen, Jinlin Wu, Mobarakol Islam, Hongbin Liu, and Hongliang Ren

TL;DR
Surgical-LVLM is a novel large vision-language model designed for complex surgical visual question answering and grounding, improving understanding of intricate visual-language tasks in robotic surgery.
Contribution
It introduces a personalized LVLM with VP-LoRA blocks and the TIT module, enhancing complex scenario understanding and visual grounding in surgical contexts.
Findings
Sets new performance benchmarks on EndoVis datasets.
Effectively models long-range dependencies in surgical VQA.
Improves multimodal alignment in complex surgical scenes.
Abstract
Recent advancements in Surgical Visual Question Answering (Surgical-VQA) and related region grounding have shown great promise for robotic and medical applications, addressing the critical need for automated methods in personalized surgical mentorship. However, existing models primarily provide simple structured answers and struggle with complex scenarios due to their limited capability in recognizing long-range dependencies and aligning multimodal information. In this paper, we introduce Surgical-LVLM, a novel personalized large vision-language model tailored for complex surgical scenarios. Leveraging the pre-trained large vision-language model and specialized Visual Perception LoRA (VP-LoRA) blocks, our model excels in understanding complex visual-language tasks within surgical contexts. In addressing the visual grounding task, we propose the Token-Interaction (TIT) module, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
