Characterizing Deep Research: A Benchmark and Formal Definition
Abhinav Java, Ashmit Khandelwal, Sukruta Midigeshi, Aaron Halfaker, Amit Deshpande, Navin Goyal, Ankur Gupta, Nagarajan Natarajan, Amit Sharma

TL;DR
This paper formally defines the deep research task, introduces a benchmark to evaluate it, and analyzes current systems' performance and reasoning strategies, highlighting challenges and future directions in complex search and reasoning tasks.
Contribution
It provides a formal characterization of deep research, a novel benchmark dataset, and an analysis of current system capabilities and limitations.
Findings
OpenAI's model achieves the highest F1 score of 0.55.
Current systems show wide variation in performance, with scores from 0.02 to 0.72.
Analysis reveals common reasoning patterns and challenges in search mechanisms.
Abstract
Information tasks such as writing surveys or analytical reports require complex search and reasoning, and have recently been grouped under the umbrella of \textit{deep research} -- a term also adopted by recent models targeting these capabilities. Despite growing interest, the scope of the deep research task remains underdefined and its distinction from other reasoning-intensive problems is poorly understood. In this paper, we propose a formal characterization of the deep research (DR) task and introduce a benchmark to evaluate the performance of DR systems. We argue that the core defining feature of deep research is not the production of lengthy report-style outputs, but rather the high fan-out over concepts required during the search process, i.e., broad and reasoning-intensive exploration. To enable objective evaluation, we define DR using an intermediate output representation that…
Peer Reviews
Decision·ICLR 2026 Poster
The authors investigate the important task of Deep Research, which has broad applications across in both academia and industry. They propose a new benchmark, livedrbench, that has 100 tasks and is used to evaluate how well DR agents, both open source and commercial, can identify the right claims and subclaims to produce comprehensive, well-grounded report. They include in the benchmark DR tasks across multiple categories, including Materials, SciFacts, and Geo.
The authors use LLM-as-a-judge, but there is no human evaluation to ensure alignment with human scores. For example, DeepResearchGym has a nice human evaluation process that validates its LLM-as-a-judge scores against human judgments [2]. The claim/subclaim idea is very similar to the one in Mind2Web 2 [1], where the evaluation agent checks whether specific claims are present in the report itself in a hierarchical manner. In fact, Mind2Web2 seems to be more precise at identifying facts as it br
- This is a timely and relevant problem: the framing also separates information synthesis (claims) from long-form report generation, improving construct clarity for DR evaluation. - Claim-based evaluation design directly targets correctness/completeness of substantive content rather than stylistic quality; strict metrics are well-motivated for enumeration tasks. - LiveDRBench spans multiple realistic DR use cases, including science and public-interest scenarios; explicit effort to make benchmark
- Proprietary DR systems were evaluated via chat UIs with unspecified/variable budgets; non-DR baselines used APIs with fixed “≈30s” reasoning but DR systems seemingly ran with much larger, uncontrolled search/time budgets. This undermines comparability and may inflate proprietary DR performance relative to baselines and open-source agents. - Single-run reporting without variance: no per-task variance, confidence intervals, or statistical tests. Given high run-to-run variability in browsing agen
Clear problem decomposition. Separating “claims synthesis” from “report generation” is crisp and useful; the DAG framing of queries→evidence→claims aligns with how DR agents operate. Objective evaluation for multi-claim outputs. The nested claim/subclaim metric operationalizes “grounded completeness” better than stylistic LLM-judge scores used in other DR evaluations. Useful positioning vs. prior benchmarks. The paper contrasts LIVEDRBENCH with report-quality benchmarks (e.g., DeepResearch Ben
Metric dependence on LLM judging and design choices. Claim agreement and “necessary query” identification rely on LLMs (GPT-4o) and bespoke prompts; the paper needs stronger validation (e.g., human adjudication on a stratified sample, inter-rater agreement, and sensitivity analyses for the “zero credit if any subclaim is wrong” rule). Reproducibility of proprietary model comparisons. Evaluations for commercial DR systems are done via chat UIs without API parity; differences in browsing tools, g
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
