WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Xinyu Geng; Peng Xia; Zhen Zhang; Xinyu Wang; Qiuchen Wang; Ruixue Ding; Chenxi Wang; Jialong Wu; Yida Zhao; Kuan Li; Yong Jiang; Pengjun Xie; Fei Huang; Jingren Zhou

arXiv:2508.05748·cs.IR·September 3, 2025

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou

PDF

Open Access 2 Models

TL;DR

WebWatcher is a multimodal vision-language agent designed for deep research tasks, leveraging synthetic data, tool integration, and reinforcement learning to enhance reasoning and outperform existing models in complex visual-textual benchmarks.

Contribution

Introduction of WebWatcher, a novel multimodal agent with advanced visual-language reasoning, trained on synthetic data and reinforced for better generalization in complex tasks.

Findings

01

WebWatcher outperforms baseline models in four VQA benchmarks.

02

Synthetic multimodal trajectories improve training efficiency.

03

Reinforcement learning enhances WebWatcher's generalization capabilities.

Abstract

Web agents such as Deep Research have demonstrated superhuman cognitive abilities, capable of solving highly challenging information-seeking problems. However, most research remains primarily text-centric, overlooking visual information in the real world. This makes multimodal Deep Research highly challenging, as such agents require much stronger reasoning abilities in perception, logic, knowledge, and the use of more sophisticated tools compared to text-based agents. To address this limitation, we introduce WebWatcher, a multi-modal Agent for Deep Research equipped with enhanced visual-language reasoning capabilities. It leverages high-quality synthetic multimodal trajectories for efficient cold start training, utilizes various tools for deep reasoning, and further enhances generalization through reinforcement learning. To better evaluate the capabilities of multimodal agents, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning