Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning

Meng Cao; Haoze Zhao; Can Zhang; Xiaojun Chang; Ian Reid; Xiaodan Liang

arXiv:2505.20272·cs.CV·February 4, 2026

Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, Xiaodan Liang

PDF

Open Access 1 Models

TL;DR

Ground-R1 introduces a reinforcement learning framework with a novel scale-aware optimization method to improve visual grounding in large vision-language models, addressing biases toward larger image regions and enhancing interpretability.

Contribution

It proposes Ground-R1 with SRPO, a new reward calibration technique that balances learning across different-sized visual evidence regions, improving grounding accuracy.

Findings

01

Enhanced response accuracy on benchmarks

02

Improved evidence grounding consistency

03

Effective bias mitigation in visual reasoning

Abstract

Large Vision-Language Models (LVLMs) have become powerful general-purpose assistants, yet their predictions often lack reliability and interpretability due to insufficient grounding in visual evidence. The emerging thinking-with-images paradigm seeks to address this issue by explicitly anchoring reasoning to image regions. However, we empirically find that most existing methods suffer from a systematic scale-driven bias in optimization, where training rewards are dominated by large visual regions, suppressing learning from small but semantically critical evidence and leading to spurious grounding at inference time. To address this limitation, we propose Ground-R1, a de-biased thinking-with-images framework trained via a novel Scale Relative Policy Optimization (SRPO) objective that replaces standard GRPO. Specifically, our SRPO recalibrates reward learning across evidence regions of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
prithivMLmods/Lumian2-VLR-7B-Thinking
model· 12 dl· ♡ 5
12 dl♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications