ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks
Minghao Li, Ying Zeng, Zhihao Cheng, Cong Ma, Kai Jia

TL;DR
ReportBench is a comprehensive benchmark for evaluating the quality, relevance, and factual accuracy of research reports generated by large language models, using domain-specific prompts and automated analysis based on high-quality survey papers.
Contribution
This paper introduces ReportBench, a novel systematic benchmark with an automated framework for assessing research report quality from LLMs, focusing on citation relevance and factual faithfulness.
Findings
Commercial research agents outperform standalone LLMs in report quality.
There is significant room for improvement in factual consistency and coverage.
Automated evaluation framework effectively analyzes citation and statement accuracy.
Abstract
The advent of Deep Research agents has substantially reduced the time required for conducting extensive research tasks. However, these tasks inherently demand rigorous standards of factual accuracy and comprehensiveness, necessitating thorough evaluation before widespread adoption. In this paper, we propose ReportBench, a systematic benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs). Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports. ReportBench leverages high-quality published survey papers available on arXiv as gold-standard references, from which we apply reverse prompt engineering to derive domain-specific prompts and establish a comprehensive evaluation corpus. Furthermore, we develop…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper focuses on a practical use case of deep research agents as survey paper writers, as it is easier to evaluate and with today's limitations in the deep research agents, could be a major use case of the agents for scientific research. Without such tools, to complete a survey paper would take significant manual effort (days/weeks). 2. The conversion from paper back to the research question with LLMs achieved efficient data collection, which is more efficient than experts summarizing the
1. While the paper includes temporal cutoff dates in prompts (Section 2.1.2) to prevent post-publication leakage, the authors acknowledge that 'the model disregards the imposed temporal constraints' during evaluation. Even with more intense wording, this is still essentially an suggestion and not anything that is enforced. 2. This is about the analysis on low recall. It might not necessarily be bad, as a large amount of similar, and redundant research exists nowadays. Even if the deep research a
1. The problem addressed is real and timely: current deep research agents can produce long reports, but we lack an automatic and reasonably objective way to check whether those reports are actually faithful to sources. Turning this into a benchmark is a useful contribution to the agent-eval community. 2. The data construction pipeline is clear and defensible: start from human survey papers, parse LaTeX to get the ground-truth reference list, reverse-prompt the task, and explicitly constrain the
1. The current definition of “report quality” is relatively narrow. The benchmark mostly measures faithfulness-to-sources and citation correctness, but does not touch discourse-level aspects that are also important for survey-like reports (organization into sections, synthesis of competing lines of work, articulation of open problems, or taxonomic clarity). Since the task itself is framed as “academic survey–style reporting”, it would help to state more explicitly that the benchmark only targets
1. The methodology of the data construction process is a clear strength of the paper. By restricting the corpus to post-2020, peer-reviewed, and officially published arXiv survey papers, they effectively filter out noise and ensure a high-quality baseline for both content and references. The use of GPT-4o for binary classification of survey papers and the robust extraction of references directly from LaTeX sources are technically sound choices. The prompt design stage also reflects careful consi
1. The paper does not adequately discuss potential limitations of using arXiv survey papers as gold standards. Survey papers inherently represent synthesis and interpretation rather than primary research, and different surveys on the same topic may emphasize different aspects or reach different conclusions, which may lead to different 'gold ground-truth citation'. Therefore, these evaluations based on cited article matching are inevitably biased. Additionally, the reverse prompt engineering appr
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
