Training Multi-Image Vision Agents via End2End Reinforcement Learning
Chengqi Dong, Chuhuai Yue, Hang He, Rongge Mao, Fenghe Tang, S Kevin Zhou, Zekun Xu, Xiaohan Wang, Jiajun Chai, Guojun Yin

TL;DR
This paper introduces IMAgent, a reinforcement learning-based visual agent capable of multi-image reasoning, with novel tools for visual reflection and verification, achieving state-of-the-art results without supervised fine-tuning.
Contribution
The paper presents a new multi-image visual agent trained end-to-end with reinforcement learning, incorporating tools for attention management and a new multi-image QA dataset.
Findings
IMAgent achieves state-of-the-art performance on multiple benchmarks.
Tool usage improves attention focus and reasoning accuracy.
The approach eliminates the need for supervised fine-tuning data.
Abstract
Recent VLM-based agents aim to replicate OpenAI O3's "thinking with images" via tool use, yet most open-source methods restrict inputs to a single image, limiting their applicability to real-world multi-image QA tasks. To address this gap, we propose IMAgent, an open-source visual agent trained with end-to-end reinforcement learning for fine-grained single/multi-image reasoning. During inference, VLMs tend to gradually neglect visual inputs; to mitigate this issue, we design two dedicated tools for visual reflection and verification, enabling the model to actively refocus attention on image content. Beyond that, we, for the first time, reveal how tool usage enhances agent performance from an attention perspective. Equipped with a carefully designed two-layer motion trajectory masking strategy and tool-use reward gain, IMAgent acquires an effective tool-use paradigm through pure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
