DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation

Qianqian Xie; Qingheng Xiong; He Zhu; Tiantian Xia; Xueming Han; Fanyu Meng; Jiakai Wang; Zhiqi Bai; Chengkang Jiang; Zhaohui Wang; Yubin Guo; Yuqing Wen; Jiayang Mao; Zijie Zhang; Shihao Li; Yanghai Wang; Yuxiang Ren; Junlan Feng; Jiaheng Liu

arXiv:2604.14683·cs.AI·April 17, 2026

DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation

Qianqian Xie, Qingheng Xiong, He Zhu, Tiantian Xia, Xueming Han, Fanyu Meng, Jiakai Wang, Zhiqi Bai, Chengkang Jiang, Zhaohui Wang, Yubin Guo, Yuqing Wen, Jiayang Mao, Zijie Zhang, Shihao Li, Yanghai Wang, Yuxiang Ren, Junlan Feng, Jiaheng Liu

PDF

2 Repos 1 Datasets

TL;DR

DR$^{3}$-Eval is a new benchmark designed to realistically and reproducibly evaluate deep research agents on complex, multimodal report generation tasks using authentic data and a comprehensive evaluation framework.

Contribution

It introduces a realistic, verifiable benchmark with a multi-dimensional evaluation framework for assessing deep research agents in complex web-like environments.

Findings

01

DR$^{3}$-Eval is highly challenging for current agents.

02

The benchmark reveals critical issues in retrieval robustness.

03

The evaluation correlates well with human judgments.

Abstract

Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR $^{3}$ -Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR $^{3}$ -Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

NJU-LINK/DR3-Eval
dataset· 2.5k dl
2.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.