Surgery-R1: Advancing Surgical-VQLA with Reasoning Multimodal Large Language Model via Reinforcement Learning

Pengfei Hao; Shuaibo Li; Hongqiu Wang; Zhizhuo Kou; Junhang Zhang; Guang Yang; Lei Zhu

arXiv:2506.19469·cs.CV·June 25, 2025

Surgery-R1: Advancing Surgical-VQLA with Reasoning Multimodal Large Language Model via Reinforcement Learning

Pengfei Hao, Shuaibo Li, Hongqiu Wang, Zhizhuo Kou, Junhang Zhang, Guang Yang, Lei Zhu

PDF

Open Access

TL;DR

This paper introduces Surgery-R1, a reasoning multimodal large language model for surgical scene understanding, which enhances interpretability and reasoning in surgical-VQLA tasks through a new dataset and a two-stage fine-tuning process.

Contribution

The paper presents the first reasoning multimodal large language model for surgical-VQLA, along with a new dataset and a novel fine-tuning approach to improve reasoning and interpretability.

Findings

01

Surgery-R1 outperforms existing models in surgical-VQLA tasks.

02

The two-stage fine-tuning enhances reasoning capabilities.

03

The Multimodal Coherence reward improves training efficiency.

Abstract

In recent years, significant progress has been made in the field of surgical scene understanding, particularly in the task of Visual Question Localized-Answering in robotic surgery (Surgical-VQLA). However, existing Surgical-VQLA models lack deep reasoning capabilities and interpretability in surgical scenes, which limits their reliability and potential for development in clinical applications. To address this issue, inspired by the development of Reasoning Multimodal Large Language Models (MLLMs), we first build the Surgery-R1-54k dataset, including paired data for Visual-QA, Grounding-QA, and Chain-of-Thought (CoT). Then, we propose the first Reasoning MLLM for Surgical-VQLA (Surgery-R1). In our Surgery-R1, we design a two-stage fine-tuning mechanism to enable the basic MLLM with complex reasoning abilities by utilizing supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT).…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education