Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and   Committee Discussions

Ruochen Zhao; Wenxuan Zhang; Yew Ken Chia; Weiwen Xu; Deli Zhao,; Lidong Bing

arXiv:2405.20267·cs.CL·October 8, 2024·1 cites

Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions

Ruochen Zhao, Wenxuan Zhang, Yew Ken Chia, Weiwen Xu, Deli Zhao,, Lidong Bing

PDF

Open Access 1 Repo 1 Video

TL;DR

Auto-Arena is an automated evaluation framework for LLMs that uses agent peer battles and committee discussions, achieving high correlation with human preferences and reducing manual effort.

Contribution

It introduces a fully automated LLM evaluation method combining peer battles and collaborative judging, outperforming traditional benchmarks in reliability and efficiency.

Findings

01

92.14% correlation with human preferences

02

Outperforms previous expert-annotated benchmarks

03

Reduces manual evaluation efforts

Abstract

As LLMs continuously evolve, there is an urgent need for a reliable evaluation method that delivers trustworthy results promptly. Currently, static benchmarks suffer from inflexibility and unreliability, leading users to prefer human voting platforms like Chatbot Arena. However, human evaluations require significant manual effort. To address this, we propose the Auto-Arena, an innovative framework that automates the entire evaluation process using LLM-powered agents. Firstly, an LLM examiner generates questions. Then, two LLM candidates engage in a multi-round peer battle based on individual questions, aiming at revealing their true performance differences. Finally, a committee of LLM judges collaboratively discusses and decides the winner, reducing bias and enhancing fairness. During the peer battles, we observe intriguing scenarios where the LLM candidates display competitive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Auto-Arena/Auto-Arena-LLMs
noneOfficial

Videos

Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions· underline

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Artificial Intelligence in Law