WideSearch: Benchmarking Agentic Broad Info-Seeking

Ryan Wong; Jiawei Wang; Junjie Zhao; Li Chen; Yan Gao; Long Zhang; Xuan Zhou; Zuo Wang; Kai Xiang; Ge Zhang; Wenhao Huang; Yang Wang; Ke Wang

arXiv:2508.07999·cs.CL·August 29, 2025

WideSearch: Benchmarking Agentic Broad Info-Seeking

Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, Ke Wang

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

WideSearch introduces a comprehensive benchmark to evaluate the reliability of large language model-based search agents in large-scale information collection tasks, revealing significant current deficiencies and guiding future improvements.

Contribution

The paper presents a new benchmark dataset, evaluation pipeline, and comprehensive analysis of agentic search systems for large-scale information seeking tasks.

Findings

01

Most systems achieve near 0 extpercent success rates.

02

The best system reaches only 5 extpercent success.

03

Human cross-validation can achieve near 100 extpercent success.

Abstract

From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

(1) The paper tackles an underexplored yet practically crucial dimension of agent evaluation—broad, high-fidelity information gathering—which complements existing reasoning- and synthesis-oriented benchmarks. The conceptual framing of WideSearch as the “breadth” counterpart to DeepSearch and DeepResearch is clear, coherent, and well-motivated. (2) The five-stage human-in-the-loop curation pipeline is rigorous and systematic, ensuring that all tasks are complex, verifiable, and genuinely depende

Weaknesses

The benchmark construction and evaluation are solid in general. It would be great if the authors could further conduct quantitative analysis regarding the major challenges and failure patterns mentioned in sections 4.1 and 4.2.

Reviewer 02Rating 4Confidence 5

Strengths

The benchmark is unique from previous ones, which extend the commonly used browsecomp-style simple-answer questions to broad information gathering.

Weaknesses

- Most importantly, even though the paper's motivation is very clear and reasonable, it remains a straightforward extension to the existing browsecomp-style benchmarks (from simple fact to a list of fact). Despite its potential usefulness (given the success of browsecomp), it's hard to admire from a research perspective. Also, it's still constrained to this very specific type of tasks, for the convenience of evaluation. - Regarding the experiments: - 1) the tasks appear to be too challenging

Reviewer 03Rating 6Confidence 4

Strengths

Overall, it's a very good work that extends the frontiers of evaluating newly emerged deep research systems and tools. 1. This paper proposes a novel dataset aimed at breadth and completeness across many entities. Rigorous five-stage human-centered curation ensures realistic and verifiable tasks. 2. Results expose a fundamental limitation of current agents—completeness at scale—and show multi-agent setups help but don’t solve it, offering a concrete target for future research.

Weaknesses

1. The evaluation protocol might not be robust enough. This paper uses markdown format as a protocol, applies several fuzzy or exact matches to number/date/urls, and finally applies LLM-as-a-judge to evaluate complex answers. This paradigm replies on pre-crafted ground truths and would be sensitive to those queries whose answer might change with time (for example, the top-selling electrical toothbrush in the past week). 2. The current success criterion may be too brittle, which requires perfect

Code & Models

Datasets

ByteDance-Seed/WideSearch
dataset· 9.4k dl
9.4k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Neural Networks and Reservoir Computing · Ferroelectric and Negative Capacitance Devices