VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning
Ruiyang Zhang, Qianguo Sun, Chao Song, Yiyan Qi, Zhedong Zheng

TL;DR
VSearcher is a reinforcement learning-based multimodal search agent capable of long-horizon, multi-turn interactions with web environments, significantly improving multimodal web search performance over existing models.
Contribution
The paper introduces VSearcher, a novel multimodal search agent with a new data synthesis pipeline and a specialized benchmark for evaluating multimodal search capabilities.
Findings
VSearcher outperforms recent multimodal search agents.
VSearcher surpasses some proprietary models on web search tasks.
The proposed methods improve multi-turn, multimodal web interaction.
Abstract
Large models are increasingly becoming autonomous agents that interact with real-world environments and use external tools to augment their static capabilities. However, most recent progress has focused on text-only large language models, which are limited to a single modality and therefore have narrower application scenarios. On the other hand, multimodal large models, while offering stronger perceptual capabilities, remain limited to static knowledge and lack the ability to access and leverage up-to-date web information. In this paper, we propose VSearcher, turning static multimodal model into multimodal search agent capable of long-horizon, multi-turn tool use in real-world web environments, including text search, image search, and web browsing, via reinforcement learning. Specifically, we introduce Iterative Injection Data Synthesis pipeline to generate large-scale, complex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
