Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory
Yuhao Zhan, Tianyu Fan, Linxuan Huang, Zirui Guo, Chao Huang

TL;DR
This paper introduces a process-aware evaluation framework for Deep Research Agents, focusing on diagnosing hallucinations throughout the research process, revealing systemic weaknesses and guiding future improvements.
Contribution
It proposes the PIES Taxonomy and a fine-grained evaluation framework to analyze hallucinations in DRAs along functional and error dimensions.
Findings
No current DRA achieves robust reliability.
Hallucination propagation and biases are key failure sources.
DeepHalluBench effectively isolates hallucination-prone tasks.
Abstract
Diagnosing the failure mechanisms of Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-end evaluation, obscuring critical intermediate hallucinations, such as flawed planning, that accumulate throughout the research trajectory. To bridge this gap, we propose a shift from outcome-based to process-aware evaluation by auditing the full research trajectory. We introduce the PIES Taxonomy to categorize hallucinations along functional components (Planning vs. Summarization) and error properties (Explicit vs. Implicit). We instantiate this taxonomy into a fine-grained evaluation framework that decomposes the trajectory to rigorously quantify these hallucinations. Leveraging this framework to isolate 100 distinctively hallucination-prone tasks including adversarial scenarios, we curate DeepHalluBench. Experiments on six state-of-theart…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
