Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Leyao Wang; Yanan He; Peng Chen; Asaf Yehudai; Yixin Liu; Rex Ying; Michal Shmueli-Scheuer; Arman Cohan

arXiv:2605.19196·cs.CL·May 20, 2026

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

PDF

TL;DR

This paper introduces REFLECT, a benchmark for evaluating the reliability of LLM-based judges in assessing complex research agents, revealing significant limitations in current models' accuracy.

Contribution

The paper presents a detailed taxonomy and a controlled evaluation framework for assessing the fine-grained failure detection capabilities of LLM judges in open-ended research tasks.

Findings

01

Current LLM judges achieve below 55% accuracy in failure detection.

02

Judges perform poorly on evidence verification tasks.

03

Systematic limitations in judge reliability are identified.

Abstract

Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta-evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves. Existing meta-evaluations fall short in two ways: (1) reliance on coarse, subjective human-preference agreement; (2) focus on instruction-following or verifiable tasks, leaving open-ended agent executions unexplored. To address these gaps, we introduce REFLECT (REliable Fine-grained LLM judge Evaluation via Controlled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.