Agent-as-a-Judge: Evaluate Agents with Agents

Mingchen Zhuge; Changsheng Zhao; Dylan Ashley; Wenyi Wang; Dmitrii; Khizbullin; Yunyang Xiong; Zechun Liu; Ernie Chang; Raghuraman; Krishnamoorthi; Yuandong Tian; Yangyang Shi; Vikas Chandra; J\"urgen; Schmidhuber

arXiv:2410.10934·cs.AI·October 18, 2024·3 cites

Agent-as-a-Judge: Evaluate Agents with Agents

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii, Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman, Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, J\"urgen, Schmidhuber

PDF

Open Access 1 Repo 1 Datasets

TL;DR

The paper introduces the Agent-as-a-Judge framework, enabling agentic systems to evaluate each other for more reliable and scalable self-improvement, demonstrated through a new code generation benchmark called DevAI.

Contribution

It presents the novel Agent-as-a-Judge framework that incorporates agentic features for evaluation, extending previous LLM-based approaches, and introduces DevAI benchmark for testing.

Findings

01

Agent-as-a-Judge outperforms LLM-as-a-Judge in evaluations.

02

It is as reliable as human evaluation baseline.

03

Demonstrates effectiveness on 55 AI development tasks.

Abstract

Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes -- ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

metauto-ai/agent-as-a-judge
noneOfficial

Datasets

DEVAI-benchmark/DEVAI
dataset· 243 dl
243 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation

MethodsFocus