TL;DR
This paper introduces InfoDeepSeek, a new benchmark and evaluation framework for assessing agentic information seeking in dynamic web environments, addressing limitations of existing static benchmarks.
Contribution
It presents a systematic methodology for creating challenging queries and a comprehensive evaluation framework tailored to real-world, dynamic information seeking scenarios.
Findings
Reveals nuanced agent behaviors in dynamic environments
Provides actionable insights for improving agentic RAG systems
Demonstrates the effectiveness of the benchmark across various models
Abstract
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by grounding responses with retrieved information. As an emerging paradigm, Agentic RAG further enhances this process by introducing autonomous LLM agents into the information seeking process. However, existing benchmarks fall short in evaluating such systems, as they are confined to a static retrieval environment with a fixed, limited corpus} and simple queries that fail to elicit agentic behavior. Moreover, their evaluation protocols assess information seeking effectiveness by pre-defined gold sets of documents, making them unsuitable for the open-ended and dynamic nature of real-world web environments. To bridge this gap, we present InfoDeepSeek, a new benchmark with challenging questions designed for assessing agentic information seeking in real-world, dynamic web environments. We propose a systematic…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Evaluating agentic RAG systems on the live web is novel and addresses a critical gap in prior static‑corpus evaluation pipelines; the curated benchmark is valuable to the community. 2. The proposed metrics—IA@k, EEU, and IC—are specifically designed to assess information‑seeking and evidence‑utilization capabilities of agentic RAG systems. 3. Thorough experiment on different aspects including various LLMs, search engines, number of searching steps during inference, and predominant languages.T
1. **Reproducibility challenges in a dynamic web environment**: As the authors note, the real‑world web involves _`"massive document volume, content drift, URL decay"`_, which motivates InfoDeepSeek. However, this also means that search results and webpage contents can change over time, and thus two researchers running the same agent at different times may obtain different outcomes if sources change or vanish, making results hard to reproduce and compare across time. 2. **LLM evaluation**: The c
1. The core idea for the paper is easy to understand. Notations are pretty clear in general. 2. The dataset may be useful for evaluating agentic search systems. 3. Some of the newly proposed metrics could be useful for evaluating agentic information retrieval.
1. The dataset does not create new characteristics that are lacking in current benchmarks. For example, FRAMES (Krishna et al., 2025) and BrowseComp (Wei et al., 2025) are designed to be difficult and require multi-turn retrieval. I think the paper should focus more on what past datasets lack and what are the new contributions. 2. Definition of the metrics may not be very useful. For example, information accuracy (IA) does not actually entail “information accuracy”, because inaccurate informati
1. The motivation is solid, which clearly points out the drawbacks of existing agentic search benchmarks that rely on static corpus evaluation. 2. The data construction is relatively comprehensive, covering key aspects relevant to agentic information-seeking evaluation, such as domain diversity, question difficulty, and question types. 3. Experiments include multiple LLMs, search engines, and ablations (test-time scaling, retrieval interference), giving the work empirical depth.
1. The involvement of LLMs/agents in data filtering and construction may introduce bias. It is necessary to first examine the consistency between LLMs/agents and human annotators to ensure reliability, as well as to conduct multiple evaluation runs to verify reproducibility. 2. The paper did not report confidence intervals or variance analyses for the metric EEU, making its stability questionable, since a decrease in ACC could paradoxically lead to an increase in EEU. 3. The IC metric relies o
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Attention Dropout · Softmax · WordPiece · Weight Decay · Dropout · Adam · Linear Layer
