Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

TL;DR
This paper introduces REFLECT, a benchmark for evaluating the reliability of LLM-based judges in assessing complex research agents, revealing significant limitations in current models' accuracy.
Contribution
The paper presents a detailed taxonomy and a controlled evaluation framework for assessing the fine-grained failure detection capabilities of LLM judges in open-ended research tasks.
Findings
Current LLM judges achieve below 55% accuracy in failure detection.
Judges perform poorly on evidence verification tasks.
Systematic limitations in judge reliability are identified.
Abstract
Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta-evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves. Existing meta-evaluations fall short in two ways: (1) reliance on coarse, subjective human-preference agreement; (2) focus on instruction-following or verifiable tasks, leaving open-ended agent executions unexplored. To address these gaps, we introduce REFLECT (REliable Fine-grained LLM judge Evaluation via Controlled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
