Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning
Bob Zhang, Haoran Li, Tao Zhang, Jianan Li, Cilin Yan, Xikai Liu, Jiayin Cai, Yanbin Hao

TL;DR
This paper enhances multi-image grounding in multimodal large language models by using reinforcement learning and chain-of-thought data to improve reasoning capabilities, achieving significant performance gains.
Contribution
It introduces a reinforcement learning-based post-training method with chain-of-thought data synthesis and rule-based RL to improve multi-image reasoning in MLLMs.
Findings
Achieved +9.04% on MIG-Bench
Achieved +4.41% average improvement across seven benchmarks
Demonstrated effectiveness of RL-based training for multi-image reasoning
Abstract
Multimodal Large Language Models (MLLMs) perform well in single-image visual grounding but struggle with real-world tasks that demand cross-image reasoning and multi-modal instructions. To address this, we adopt a reinforcement learning (RL) based post-training strategy for MLLMs in multi-image grounding tasks. We first synthesize high-quality chain-of-thought (CoT) data for cold-start initialization, followed by supervised fine-tuning (SFT) using low-rank adaptation (LoRA). Subsequently, we apply rejection sampling with the merged SFT model to curate reliable RL data and use rule-based RL to guide the model toward optimal reasoning paths. Extensive experiments demonstrate the effectiveness of our approach, achieving +9.04% on MIG-Bench and +4.41% on average across seven out-of-domain benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
