ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards
Wentao Yan, Shengqin Wang, Huichi Zhou, Yihang Chen, Kun Shao, Yuan Xie, Zhizhong Zhang

TL;DR
ProMMSearchAgent introduces a novel training paradigm for multimodal search agents that leverages process-oriented rewards and a decoupled environment to achieve zero-shot transfer and state-of-the-art performance.
Contribution
It proposes a Sim-to-Real training approach with an introspective reward mechanism, enabling effective policy learning in constrained environments for knowledge-intensive visual reasoning.
Findings
Zero-shot transfer of the trained policy to Google Search API.
Achieved new state-of-the-art performance on FVQA-test, InfoSeek, and MMSearch datasets.
Outperformed previous methods by significant margins (+5.1%, +6.3%, +11.3%).
Abstract
Training multimodal agents via reinforcement learning for knowledge-intensive visual reasoning is fundamentally hindered by the extreme sparsity of outcome-based supervision and the unpredictability of live web environments. To resolve these algorithmic and environmental bottlenecks, we introduce ProMMSearchAgent, establishing a novel Sim-to-Real training paradigm for multimodal search. We decouple policy learning into a deterministic, local static sandbox. Crucially, to learn effectively within this constrained environment, we propose an introspective process-oriented reward. By probing the agent's own parametric knowledge boundaries, we generate dense behavioral metadata that explicitly rewards the correct cognitive decision, initiating a multimodal or text search only when visually or factually uncertain. Extensive experiments demonstrate that our locally-trained policy transfers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
