Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu,, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang,, Joseph E. Gonzalez, Ion Stoica

TL;DR
This paper investigates using large language models like GPT-4 as scalable, explainable judges for evaluating chat assistants, demonstrating high agreement with human preferences through new benchmarks and addressing limitations of LLM-based evaluation.
Contribution
It introduces MT-bench and Chatbot Arena benchmarks to validate LLMs as judges, showing they can reliably match human preferences and complement traditional evaluation methods.
Findings
GPT-4 matches human preferences with over 80% agreement.
LLM judges effectively evaluate open-ended questions and model variants.
Proposed solutions mitigate biases and limitations of LLM-based evaluation.
Abstract
Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/shieldgemma-2bmodel· 7.2k dl· ♡ 1117.2k dl♡ 111
- 🤗Sahabat-AI/Llama-Sahabat-AI-v2-70B-ITmodel· 110 dl· ♡ 13110 dl♡ 13
- 🤗ChiKoi7/stablelm-zephyr-3b-Heretic-GGUFmodel· 170 dl· ♡ 2170 dl♡ 2
- 🤗lmsys/vicuna-13b-delta-v0model· 530 dl· ♡ 452530 dl♡ 452
- 🤗lmsys/vicuna-7b-delta-v0model· 231 dl· ♡ 165231 dl♡ 165
- 🤗lmsys/vicuna-7b-delta-v1.1model· 1.0k dl· ♡ 2001.0k dl♡ 200
- 🤗lmsys/vicuna-13b-delta-v1.1model· 906 dl· ♡ 409906 dl♡ 409
- 🤗lmsys/vicuna-13b-v1.1model· 1.1k dl· ♡ 1001.1k dl♡ 100
- 🤗lmsys/vicuna-7b-v1.1model· 5.3k dl· ♡ 775.3k dl♡ 77
- 🤗lmsys/vicuna-7b-v1.3model· 12k dl· ♡ 14012k dl♡ 140
Videos
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · AI in Service Interactions
MethodsMONTANA +256777182862 Love spells caster, voodoo spells IN MONTANA, BILLINGS, MISSOULA, BLACK MAGIC GURU · Multi-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Dense Connections · Dropout · Byte Pair Encoding · Softmax
