DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents
Shengqin Wang, Wentao Yan, Huichi Zhou, Yihang Chen, Kun Shao, Zhizhong Zhang, Yuan Xie

TL;DR
This paper introduces the Deepening Reasoning MMSearchAgent, a framework that enhances multimodal search agents by improving trajectory advantage signals and reducing redundancy, leading to state-of-the-art performance.
Contribution
It proposes a novel deep reasoning framework that leverages structural proximity and dynamic reward calibration for better multimodal agent performance.
Findings
Achieved 8.4% improvement over MMSearch-R1 on FVQA-test.
Constructed a multi-step reasoning dataset with 3602 high-quality QA pairs.
Demonstrated state-of-the-art results through extensive experiments.
Abstract
Agentic multimodal models have garnered significant attention for their ability to leverage external tools to tackle complex tasks. However, it is observed that such agents often meet premature interaction collapse, caused by two primary reasons: 1) the terminal reward often appending on the last token prevents the advantage from distinguishing trajectories with exploratory behavior; 2) excessively redundant context hinders the agent from absorbing useful feedback. To address these issues, we propose the Deepening Reasoning MMSearchAgent, the framework leverages the structural proximity to derive advantage signals from the whole rollout trajectories in an entire batch, such that trajectories of different lengths are further encouraged to be generated, even when containing the same correct answer. Additionally, differentiated gaussian rewards are employed to dynamically calibrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
