Agent-as-a-Judge: Evaluate Agents with Agents
Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii, Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman, Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, J\"urgen, Schmidhuber

TL;DR
The paper introduces the Agent-as-a-Judge framework, enabling agentic systems to evaluate each other for more reliable and scalable self-improvement, demonstrated through a new code generation benchmark called DevAI.
Contribution
It presents the novel Agent-as-a-Judge framework that incorporates agentic features for evaluation, extending previous LLM-based approaches, and introduces DevAI benchmark for testing.
Findings
Agent-as-a-Judge outperforms LLM-as-a-Judge in evaluations.
It is as reliable as human evaluation baseline.
Demonstrates effectiveness on 55 AI development tasks.
Abstract
Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes -- ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation
MethodsFocus
