ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

Minghao Li; Ying Zeng; Zhihao Cheng; Cong Ma; Kai Jia

arXiv:2508.15804·cs.CL·August 25, 2025

ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

Minghao Li, Ying Zeng, Zhihao Cheng, Cong Ma, Kai Jia

PDF

1 Datasets 3 Reviews

TL;DR

ReportBench is a comprehensive benchmark for evaluating the quality, relevance, and factual accuracy of research reports generated by large language models, using domain-specific prompts and automated analysis based on high-quality survey papers.

Contribution

This paper introduces ReportBench, a novel systematic benchmark with an automated framework for assessing research report quality from LLMs, focusing on citation relevance and factual faithfulness.

Findings

01

Commercial research agents outperform standalone LLMs in report quality.

02

There is significant room for improvement in factual consistency and coverage.

03

Automated evaluation framework effectively analyzes citation and statement accuracy.

Abstract

The advent of Deep Research agents has substantially reduced the time required for conducting extensive research tasks. However, these tasks inherently demand rigorous standards of factual accuracy and comprehensiveness, necessitating thorough evaluation before widespread adoption. In this paper, we propose ReportBench, a systematic benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs). Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports. ReportBench leverages high-quality published survey papers available on arXiv as gold-standard references, from which we apply reverse prompt engineering to derive domain-specific prompts and establish a comprehensive evaluation corpus. Furthermore, we develop…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

1. The paper focuses on a practical use case of deep research agents as survey paper writers, as it is easier to evaluate and with today's limitations in the deep research agents, could be a major use case of the agents for scientific research. Without such tools, to complete a survey paper would take significant manual effort (days/weeks). 2. The conversion from paper back to the research question with LLMs achieved efficient data collection, which is more efficient than experts summarizing the

Weaknesses

1. While the paper includes temporal cutoff dates in prompts (Section 2.1.2) to prevent post-publication leakage, the authors acknowledge that 'the model disregards the imposed temporal constraints' during evaluation. Even with more intense wording, this is still essentially an suggestion and not anything that is enforced. 2. This is about the analysis on low recall. It might not necessarily be bad, as a large amount of similar, and redundant research exists nowadays. Even if the deep research a

Reviewer 02Rating 6Confidence 4

Strengths

1. The problem addressed is real and timely: current deep research agents can produce long reports, but we lack an automatic and reasonably objective way to check whether those reports are actually faithful to sources. Turning this into a benchmark is a useful contribution to the agent-eval community. 2. The data construction pipeline is clear and defensible: start from human survey papers, parse LaTeX to get the ground-truth reference list, reverse-prompt the task, and explicitly constrain the

Weaknesses

1. The current definition of “report quality” is relatively narrow. The benchmark mostly measures faithfulness-to-sources and citation correctness, but does not touch discourse-level aspects that are also important for survey-like reports (organization into sections, synthesis of competing lines of work, articulation of open problems, or taxonomic clarity). Since the task itself is framed as “academic survey–style reporting”, it would help to state more explicitly that the benchmark only targets

Reviewer 03Rating 4Confidence 4

Strengths

1. The methodology of the data construction process is a clear strength of the paper. By restricting the corpus to post-2020, peer-reviewed, and officially published arXiv survey papers, they effectively filter out noise and ensure a high-quality baseline for both content and references. The use of GPT-4o for binary classification of survey papers and the robust extraction of references directly from LaTeX sources are technically sound choices. The prompt design stage also reflects careful consi

Weaknesses

1. The paper does not adequately discuss potential limitations of using arXiv survey papers as gold standards. Survey papers inherently represent synthesis and interpretation rather than primary research, and different surveys on the same topic may emphasize different aspects or reach different conclusions, which may lead to different 'gold ground-truth citation'. Therefore, these evaluations based on cited article matching are inevitably biased. Additionally, the reverse prompt engineering appr

Code & Models

Datasets

ByteDance-BandAI/ReportBench
dataset· 78 dl
78 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.