MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization

Huihui Xu; Yuanpeng Nie; Hualiang Wang; Ying Chen; Wei Li; Junzhi Ning; Lihao Liu; Hongqiu Wang; Lei Zhu; Jiyao Liu; Xiaomeng Li; Junjun He

arXiv:2507.02994·cs.LG·July 8, 2025

MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization

Huihui Xu, Yuanpeng Nie, Hualiang Wang, Ying Chen, Wei Li, Junzhi Ning, Lihao Liu, Hongqiu Wang, Lei Zhu, Jiyao Liu, Xiaomeng Li, Junjun He

PDF

TL;DR

This paper introduces a novel reinforcement learning approach called Spatial-Semantic Rewarded Group Relative Policy Optimization for medical image grounding, enabling models to localize regions based on text without needing costly reasoning annotations.

Contribution

It adapts the GRPO framework with spatial-semantic rewards and a Chain-of-Box template to improve medical image grounding without chain-of-thought annotations.

Findings

01

Achieves state-of-the-art results on three medical imaging datasets.

02

Effectively reasons about spatial regions during intermediate steps.

03

Validates each component's contribution through ablation studies.

Abstract

Medical Image Grounding (MIG), which involves localizing specific regions in medical images based on textual descriptions, requires models to not only perceive regions but also deduce spatial relationships of these regions. Existing Vision-Language Models (VLMs) for MIG often rely on Supervised Fine-Tuning (SFT) with large amounts of Chain-of-Thought (CoT) reasoning annotations, which are expensive and time-consuming to acquire. Recently, DeepSeek-R1 demonstrated that Large Language Models (LLMs) can acquire reasoning abilities through Group Relative Policy Optimization (GRPO) without requiring CoT annotations. In this paper, we adapt the GRPO reinforcement learning framework to VLMs for Medical Image Grounding. We propose the Spatial-Semantic Rewarded Group Relative Policy Optimization to train the model without CoT reasoning annotations. Specifically, we introduce Spatial-Semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.