How Far Are We from Genuinely Useful Deep Research Agents?

Dingling Zhang; He Zhu; Jincheng Ren; Kangqi Song; Xinran Zhou; Boyu Feng; Shudong Liu; Jiabin Luo; Weihao Xie; Zhaohui Wang; Tianrui Qin; King Zhu; Yuqing Wang; Qianben Chen; Yuchen Eleanor Jiang; Wei Wang; Jiaheng Liu; Wangchunshu Zhou

arXiv:2512.01948·cs.CL·December 16, 2025

How Far Are We from Genuinely Useful Deep Research Agents?

Dingling Zhang, He Zhu, Jincheng Ren, Kangqi Song, Xinran Zhou, Boyu Feng, Shudong Liu, Jiabin Luo, Weihao Xie, Zhaohui Wang, Tianrui Qin, King Zhu, Yuqing Wang, Qianben Chen, Yuchen Eleanor Jiang, Wei Wang, Jiaheng Liu, Wangchunshu Zhou

PDF

Open Access

TL;DR

This paper introduces a new benchmark and failure taxonomy for Deep Research Agents, revealing current limitations in evidence integration and reasoning, and aiming to guide future improvements in automated research report generation.

Contribution

It presents FINDER, a standardized benchmark with structured research tasks, and DEFT, a detailed failure taxonomy for evaluating deep research agents.

Findings

01

Current DRAs excel at task understanding but struggle with evidence verification.

02

Most failures occur in reasoning, retrieval, and generation processes.

03

The benchmark enables more objective evaluation of DRA performance.

Abstract

Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive reports remains overlooked. Worse, current benchmarks for report synthesis suffer from task complexity and subjective metrics -- this fails to reflect user demands and limits the practical utility of generated reports. To address these gaps, we present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding. Based on approximately 1,000 reports produced by mainstream DRAs, we further propose Deep rEsearch Failure Taxonomy (DEFT), the first failure taxonomy for deep research…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Expert finding and Q&A systems · Text Readability and Simplification