Neither Valid nor Reliable? Investigating the Use of LLMs as Judges
Khaoula Chehbouni, Mohammed Haddou, Jackie Chi Kit Cheung, Golnoosh Farnadi

TL;DR
This paper critically examines the use of large language models as evaluators in natural language generation, highlighting their limitations and the need for more rigorous validation to ensure reliable and valid assessments.
Contribution
It provides a measurement-theoretic critique of LLJs, identifying key assumptions and challenges in their use for NLG evaluation across multiple applications.
Findings
LLJs may not reliably proxy human judgment
Current LLJ capabilities as evaluators are limited
More rigorous validation is necessary for responsible use
Abstract
Evaluating natural language generation (NLG) systems remains a core challenge of natural language processing (NLP), further complicated by the rise of large language models (LLMs) that aims to be general-purpose. Recently, large language models as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions underlying the use of LLJs: their ability to act as proxies for human judgment, their capabilities as evaluators, their scalability, and their cost-effectiveness. We examine how each of these assumptions may be challenged by the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsLegal Education and Practice Innovations · Law, Economics, and Judicial Systems · Legal Systems and Judicial Processes
