Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng; Wei-Lin Chiang; Ying Sheng; Siyuan Zhuang; Zhanghao Wu,; Yonghao Zhuang; Zi Lin; Zhuohan Li; Dacheng Li; Eric P. Xing; Hao Zhang,; Joseph E. Gonzalez; Ion Stoica

arXiv:2306.05685·cs.CL·December 27, 2023·442 cites

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu,, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang,, Joseph E. Gonzalez, Ion Stoica

PDF

Open Access 5 Repos 10 Models 5 Datasets 1 Video

TL;DR

This paper investigates using large language models like GPT-4 as scalable, explainable judges for evaluating chat assistants, demonstrating high agreement with human preferences through new benchmarks and addressing limitations of LLM-based evaluation.

Contribution

It introduces MT-bench and Chatbot Arena benchmarks to validate LLMs as judges, showing they can reliably match human preferences and complement traditional evaluation methods.

Findings

01

GPT-4 matches human preferences with over 80% agreement.

02

LLM judges effectively evaluate open-ended questions and model variants.

03

Proposed solutions mitigate biases and limitations of LLM-based evaluation.

Abstract

Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena· slideslive

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · AI in Service Interactions

MethodsMONTANA +256777182862 Love spells caster, voodoo spells IN MONTANA, BILLINGS, MISSOULA, BLACK MAGIC GURU · Multi-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Dense Connections · Dropout · Byte Pair Encoding · Softmax