A Survey on LLM-as-a-Judge

Jiawei Gu; Xuhui Jiang; Zhichao Shi; Hexiang Tan; Xuehao Zhai; Chengjin Xu; Wei Li; Yinghan Shen; Shengjie Ma; Honghao Liu; Saizhuo Wang; Kun Zhang; Yuanzhuo Wang; Wen Gao; Lionel Ni; Jian Guo

arXiv:2411.15594·cs.CL·October 21, 2025·23 cites

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, Jian Guo

PDF

Open Access 2 Repos 1 Datasets

TL;DR

This survey reviews the emerging use of Large Language Models as evaluators, discussing strategies to improve their reliability, standardization, and applicability across diverse assessment scenarios.

Contribution

It provides a comprehensive overview of methods to enhance LLM-as-a-Judge systems, introduces a novel benchmark for reliability evaluation, and discusses practical deployment challenges.

Findings

01

Proposed methodologies for assessing LLM reliability.

02

Identified key challenges in standardization and bias mitigation.

03

Introduced a new benchmark for evaluating LLM-as-a-Judge systems.

Abstract

Accurate and consistent evaluation is crucial for decision-making across numerous fields, yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large Language Models (LLMs) have achieved remarkable success across diverse domains, leading to the emergence of "LLM-as-a-Judge," where LLMs are employed as evaluators for complex tasks. With their ability to process diverse data types and provide scalable, cost-effective, and consistent assessments, LLMs present a compelling alternative to traditional expert-driven evaluations. However, ensuring the reliability of LLM-as-a-Judge systems remains a significant challenge that requires careful design and standardization. This paper provides a comprehensive survey of LLM-as-a-Judge, addressing the core question: How can reliable LLM-as-a-Judge systems be built? We explore strategies to enhance reliability,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

BAAI/SurveyScope
dataset· 6 dl
6 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDispute Resolution and Class Actions · Artificial Intelligence in Law · Legal Education and Practice Innovations