Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

Bob Zhang; Haoran Li; Tao Zhang; Jianan Li; Cilin Yan; Xikai Liu; Jiayin Cai; Yanbin Hao

arXiv:2507.00748·cs.CV·April 14, 2026

Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

Bob Zhang, Haoran Li, Tao Zhang, Jianan Li, Cilin Yan, Xikai Liu, Jiayin Cai, Yanbin Hao

PDF

TL;DR

This paper enhances multi-image grounding in multimodal large language models by using reinforcement learning and chain-of-thought data to improve reasoning capabilities, achieving significant performance gains.

Contribution

It introduces a reinforcement learning-based post-training method with chain-of-thought data synthesis and rule-based RL to improve multi-image reasoning in MLLMs.

Findings

01

Achieved +9.04% on MIG-Bench

02

Achieved +4.41% average improvement across seven benchmarks

03

Demonstrated effectiveness of RL-based training for multi-image reasoning

Abstract

Multimodal Large Language Models (MLLMs) perform well in single-image visual grounding but struggle with real-world tasks that demand cross-image reasoning and multi-modal instructions. To address this, we adopt a reinforcement learning (RL) based post-training strategy for MLLMs in multi-image grounding tasks. We first synthesize high-quality chain-of-thought (CoT) data for cold-start initialization, followed by supervised fine-tuning (SFT) using low-rank adaptation (LoRA). Subsequently, we apply rejection sampling with the merged SFT model to curate reliable RL data and use rule-based RL to guide the model toward optimal reasoning paths. Extensive experiments demonstrate the effectiveness of our approach, achieving +9.04% on MIG-Bench and +4.41% on average across seven out-of-domain benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.