Neither Valid nor Reliable? Investigating the Use of LLMs as Judges

Khaoula Chehbouni; Mohammed Haddou; Jackie Chi Kit Cheung; Golnoosh Farnadi

arXiv:2508.18076·cs.CL·August 29, 2025

Neither Valid nor Reliable? Investigating the Use of LLMs as Judges

Khaoula Chehbouni, Mohammed Haddou, Jackie Chi Kit Cheung, Golnoosh Farnadi

PDF

Open Access 1 Video

TL;DR

This paper critically examines the use of large language models as evaluators in natural language generation, highlighting their limitations and the need for more rigorous validation to ensure reliable and valid assessments.

Contribution

It provides a measurement-theoretic critique of LLJs, identifying key assumptions and challenges in their use for NLG evaluation across multiple applications.

Findings

01

LLJs may not reliably proxy human judgment

02

Current LLJ capabilities as evaluators are limited

03

More rigorous validation is necessary for responsible use

Abstract

Evaluating natural language generation (NLG) systems remains a core challenge of natural language processing (NLP), further complicated by the rise of large language models (LLMs) that aims to be general-purpose. Recently, large language models as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions underlying the use of LLJs: their ability to act as proxies for human judgment, their capabilities as evaluators, their scalability, and their cost-effectiveness. We examine how each of these assumptions may be challenged by the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Neither Valid nor Reliable? Investigating the Use of LLMs as Judges· slideslive

Taxonomy

TopicsLegal Education and Practice Innovations · Law, Economics, and Judicial Systems · Legal Systems and Judicial Processes