VeriWeb: Verifiable Long-Chain Web Benchmark for Agentic Information-Seeking
Shunyu Liu, Minghao Liu, Huichi Zhou, Zhenyu Cui, Yang Zhou, Yuhao Zhou, Jialiang Gao, Heng Zhou, Yunhao Yang, Wendong Fan, puzhen zhang, Ge Zhang, Jiajun Shi, Weihao Xuan, Jiaxing Huang, Shuang Luo, Fang Wu, Heli Qi, Qingcheng Zeng, Junjie Wang, Aosong Feng, Jindi Lv

TL;DR
VeriWeb is a new benchmark designed to evaluate and improve large language model agents in complex, multi-hop web information-seeking tasks that require long-term reasoning and verifiable subtask solutions.
Contribution
The paper introduces VeriWeb, a comprehensive benchmark for long-chain web tasks emphasizing verifiability and multi-step reasoning, addressing limitations of previous single-fact retrieval benchmarks.
Findings
Significant performance gaps in current agents on long-horizon tasks
VeriWeb's structure enables detailed evaluation of multi-hop reasoning
Benchmark covers 302 tasks across five real-world domains
Abstract
Recent advances have showcased the extraordinary capabilities of Large Language Model (LLM) agents in tackling web-based information-seeking tasks. However, existing efforts mainly focus on single-fact retrieval and rely on outcome-only verification, thereby limiting their scalability in realistic knowledge-intensive scenarios that involve long-horizon web tasks requiring large-scale retrieval and synthesis of information from diverse sources. In this work, we introduce VeriWeb, a novel verifiable long-chain web benchmark designed to facilitate the evaluation and development of web agents within realistic web environments. Our benchmark emphasizes two critical dimensions: (1) long-chain complexity, encompassing both breadth- and depth-oriented search tasks to assess how effectively web agents ensure comprehensive information coverage and consistent context tracking in multi-hop…
Peer Reviews
Decision·Submitted to ICLR 2026
The primary contribution is the development of a benchmark that rigorously enforces two previously neglected dimensions: long-chain complexity (integrating breadth- and depth-oriented search) and subtask-level verifiability. This fine-grained decomposition is essential, providing an informative supervision signal and allowing for error localization, which outcome-only evaluation protocols fail to capture. The dataset, curated through a costly human-annotation process across diverse real-world do
This is a paper proposing a new web agent benchmark. However, a new benchmark must clearly state the problem it solves and rigorously demonstrate why this problem and the evaluation method are important, with the analysis of failure cases being able to guide the direction of field development. The problem this paper addresses is relatively clear, and the proposal of the dataset and its construction method also have value. However, there is no particularly detailed justification for why evaluatio
- Proposes a novel benchmark emphasizing both long-chain reasoning and verifiable subtasks. - The dataset is diverse and human-annotated, covering five realistic domains. - The experimental evaluation is comprehensive, testing multiple agent paradigms and models. - The paper provides insightful analyses of action efficiency and task difficulty, helping identify weaknesses in current web agents.
- Unclear which LLM generated task instructions and subtasks. - The reasonableness of subtask decomposition is not independently validated. - Details of the human demonstration process (e.g., annotator number, quality checks, or fairness) are limited. - Many tasks involve hundreds of steps, but efficiency guarantees or annotation consistency are not analyzed. - The LLM-as-a-Judge metric may not align with human evaluation; human verification would strengthen credibility. - Single-run experi
- Tasks are long-chain and information-dense, combining multi-hop retrieval and synthesis with subtask-level verifiability. - The benchmark introduces several evaluation metrics, including task success rate, completion rate, and action efficiency. - Human-annotated trajectories provide empirically grounded task structures.
- The benchmark’s subtask-level verifiability requires each sub-answer to be fixed and unambiguous, but real-world web tasks often involve context-dependent or time-sensitive information. This design choice may therefore underrepresent the uncertainty present in realistic settings. - The absence of a human performance baseline makes it hard to interpret how well current agents perform relative to human proficiency on the same tasks. - Evaluation only uses gpt-4o as the judge, with no analysis on
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
