Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Boyu Gou; Zanming Huang; Yuting Ning; Yu Gu; Michael Lin; Weijian Qi; Andrei Kopanev; Botao Yu; Bernal Jim\'enez Guti\'errez; Yiheng Shu; Chan Hee Song; Jiaman Wu; Shijie Chen; Hanane Nour Moussa; Tianshu Zhang; Jian Xie; Yifei Li; Tianci Xue; Zeyi Liao; Kai Zhang; Boyuan Zheng; Zhaowei Cai; Viktor Rozgic; Morteza Ziyadi; Huan Sun; Yu Su

arXiv:2506.21506·cs.AI·July 4, 2025

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jim\'enez Guti\'errez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, Tianshu Zhang, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng

PDF

Open Access 1 Datasets

TL;DR

Mind2Web 2 introduces a comprehensive benchmark and a novel evaluation framework for agentic web search systems, enabling assessment of complex, long-horizon tasks and real-time answer accuracy, which advances the development of autonomous information retrieval agents.

Contribution

The paper presents Mind2Web 2, a large-scale benchmark of long-horizon tasks and an Agent-as-a-Judge evaluation framework for assessing complex agentic search systems.

Findings

01

OpenAI Deep Research achieves 50-70% of human performance.

02

The benchmark includes 130 tasks with over 1000 hours of human labor.

03

The evaluation framework effectively assesses answer correctness and source attribution.

Abstract

Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

osunlp/Mind2Web-2
dataset· 187 dl
187 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Artificial Intelligence in Law · Auction Theory and Applications