High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

Xinyu Huang; Yuhao Dong; Weiwei Tian; Bo Li; Rui Feng; Ziwei Liu

arXiv:2507.05920·cs.CV·April 21, 2026

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

Xinyu Huang, Yuhao Dong, Weiwei Tian, Bo Li, Rui Feng, Ziwei Liu

PDF

2 Repos

TL;DR

This paper introduces MGPO, a reinforcement learning framework that enhances large multi-modal models' ability to focus on relevant high-resolution image regions through iterative grounding, improving performance on visual question answering tasks.

Contribution

The paper presents a novel RL-based method enabling LMMs to learn visual grounding without explicit annotations, outperforming supervised fine-tuning and existing methods.

Findings

01

MGPO improves grounding capabilities by 5.4% on in-distribution data.

02

MGPO achieves 5.2% higher accuracy on OOD benchmarks.

03

Post-training MGPO surpasses GPT-4o on V* Bench.

Abstract

State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end reinforcement learning (RL) framework that enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Compared to supervised fine-tuning (SFT), which requires costly additional grounding annotations, our approach highlights that LMMs can emerge robust grounding abilities during the RL training process, leveraging only a binary reward function derived from the correctness of the final answer. Additionally, we observe that LMMs struggle to autonomously…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.