ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

Wentao Yan; Shengqin Wang; Huichi Zhou; Yihang Chen; Kun Shao; Yuan Xie; Zhizhong Zhang

arXiv:2604.20486·cs.CV·April 23, 2026

ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

Wentao Yan, Shengqin Wang, Huichi Zhou, Yihang Chen, Kun Shao, Yuan Xie, Zhizhong Zhang

PDF

TL;DR

ProMMSearchAgent introduces a novel training paradigm for multimodal search agents that leverages process-oriented rewards and a decoupled environment to achieve zero-shot transfer and state-of-the-art performance.

Contribution

It proposes a Sim-to-Real training approach with an introspective reward mechanism, enabling effective policy learning in constrained environments for knowledge-intensive visual reasoning.

Findings

01

Zero-shot transfer of the trained policy to Google Search API.

02

Achieved new state-of-the-art performance on FVQA-test, InfoSeek, and MMSearch datasets.

03

Outperformed previous methods by significant margins (+5.1%, +6.3%, +11.3%).

Abstract

Training multimodal agents via reinforcement learning for knowledge-intensive visual reasoning is fundamentally hindered by the extreme sparsity of outcome-based supervision and the unpredictability of live web environments. To resolve these algorithmic and environmental bottlenecks, we introduce ProMMSearchAgent, establishing a novel Sim-to-Real training paradigm for multimodal search. We decouple policy learning into a deterministic, local static sandbox. Crucially, to learn effectively within this constrained environment, we propose an introspective process-oriented reward. By probing the agent's own parametric knowledge boundaries, we generate dense behavioral metadata that explicitly rewards the correct cognitive decision, initiating a multimodal or text search only when visually or factually uncertain. Extensive experiments demonstrate that our locally-trained policy transfers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.