TL;DR
This paper introduces MGPO, a reinforcement learning framework that enhances large multi-modal models' ability to focus on relevant high-resolution image regions through iterative grounding, improving performance on visual question answering tasks.
Contribution
The paper presents a novel RL-based method enabling LMMs to learn visual grounding without explicit annotations, outperforming supervised fine-tuning and existing methods.
Findings
MGPO improves grounding capabilities by 5.4% on in-distribution data.
MGPO achieves 5.2% higher accuracy on OOD benchmarks.
Post-training MGPO surpasses GPT-4o on V* Bench.
Abstract
State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end reinforcement learning (RL) framework that enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Compared to supervised fine-tuning (SFT), which requires costly additional grounding annotations, our approach highlights that LMMs can emerge robust grounding abilities during the RL training process, leveraging only a binary reward function derived from the correctness of the final answer. Additionally, we observe that LMMs struggle to autonomously…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
