DREAM: Deep Research Evaluation with Agentic Metrics

Elad Ben Avraham; Changhao Li; Ron Dorfman; Roy Ganz; Oren Nuriel; Amir Dudai; Aviad Aberdam; Noah Flynn; Elman Mansimov; Adi Kalyanpur; Ron Litman

arXiv:2602.18940·cs.AI·February 24, 2026

DREAM: Deep Research Evaluation with Agentic Metrics

Elad Ben Avraham, Changhao Li, Ron Dorfman, Roy Ganz, Oren Nuriel, Amir Dudai, Aviad Aberdam, Noah Flynn, Elman Mansimov, Adi Kalyanpur, Ron Litman

PDF

Open Access

TL;DR

DREAM introduces an agentic evaluation framework for research agents that improves assessment of factual accuracy and temporal validity through adaptive, tool-enabled metrics, surpassing existing static benchmarks.

Contribution

The paper presents DREAM, a novel agentic evaluation framework that enhances research report assessment by incorporating tool-use capabilities for better factual and temporal validation.

Findings

01

DREAM is more sensitive to factual decay than existing benchmarks.

02

The framework enables temporally aware and grounded evaluation.

03

DREAM offers a scalable, reference-free assessment method.

Abstract

Deep Research Agents generate analyst-grade reports, yet evaluating them remains challenging due to the absence of a single ground truth and the multidimensional nature of research quality. Recent benchmarks propose distinct methodologies, yet they suffer from the Mirage of Synthesis, where strong surface-level fluency and citation alignment can obscure underlying factual and reasoning defects. We characterize this gap by introducing a taxonomy across four verticals that exposes a critical capability mismatch: static evaluators inherently lack the tool-use capabilities required to assess temporal validity and factual correctness. To address this, we propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that instantiates the principle of capability parity by making evaluation itself agentic. DREAM structures assessment through an evaluation protocol combining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Topic Modeling · Machine Learning in Materials Science