Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Aman Singh Thakur; Kartik Choudhary; Venkat Srinik Ramayapally; Sankaran Vaidyanathan; Dieuwke Hupkes

arXiv:2406.12624·cs.CL·August 19, 2025·6 cites

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuwke Hupkes

PDF

Open Access 1 Repo 1 Video 5 Reviews

TL;DR

This study evaluates the effectiveness and vulnerabilities of large language models acting as judges for other models' outputs, revealing their limitations in alignment with human judgment and highlighting potential biases and weaknesses.

Contribution

It provides a comprehensive analysis of LLMs as judges, identifying their strengths, weaknesses, and vulnerabilities, and emphasizes the need for careful use in complex evaluation scenarios.

Findings

01

Only the largest models achieve reasonable alignment with humans.

02

Judge models often differ from humans by up to 5 points in scoring.

03

Vulnerabilities include sensitivity to prompt complexity and leniency tendencies.

Abstract

Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges, focusing on a clean scenario in which inter-human agreement is high. Investigating thirteen judge models of different model sizes and families, judging answers of nine different 'examtaker models' - both base and instruction-tuned - we find that only the best (and largest) models achieve reasonable alignment with humans. However, they are still quite far behind inter-human agreement and their assigned scores may still differ with up to 5 points…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 4

Strengths

1. The provided analysis regarding the LLM judges' sensitivity to prompts, error types, a lack of robustness, and the leniency bias are interesting and valuable to future studies. 2. The paper is well-written and the findings are clearly presented.

Weaknesses

It appears that some of the main findings in this work are either not well-supported, may lack generalizability, or have been discussed in previous work. 1. Lack of generalizability: The task setting of the LLM judges selected in this work is reference-based evaluation of QA, which differs from the common application scenario where LLM judges evaluate various tasks without a gold reference (e.g., AlpacaEval, Arena Hard). Access to gold references makes the evaluation task significantly easier.

Reviewer 02Rating 3Confidence 3

Strengths

- They focus on a specific scenario with high inter-human agreement which is an attempt to isolate the judge model behavior from task ambiguity. - Several dimensions are explored: 1) model sizes and families, 2) Multiple metrics, 3) Error analysis provided. - Insights such as ``smaller models can rank exam-takers as effectively as larger ones'', and the attempted explanation that "chat models may "unlearn" some knowledge during alignment"; - The work also provides some recommendations for pra

Weaknesses

- The scope remains limited to TriviaQA. For "short, factual" answers, consider adding the "LLMBar" datasets, which have high human agreement rates > 90%. Sufficient examples can be used according to your dataset selection criteria [1]. Without the inclusion of additional datasets [1], it remains unclear how well the ranking ability would transfer. - The original claim (line 316-318) about judge performance being worse at identifying correct answers could be an artifact of including metrics tha

Reviewer 03Rating 5Confidence 4

Strengths

* The explanation of why Scott's Pi should be used instead of Kappa in judging-the-judges scenarios is a significant contribution that will benefit future researchers. * The comprehensive analysis across multiple dimensions (alignment metrics, ranking correlation, error analysis) provides valuable insights into the strengths and limitations of different judge models. * The comparison between LLM judges and lexical judges (EM, Contains) offers a novel and important perspective. This insight becom

Weaknesses

* The evaluation relies solely on TriviaQA, making it difficult to deconfound the root cause: whether the best model's performance stems from better alignment, knowledge of TriviaQA content, or simply being favored by other LLMs. Other unusual findings may also be specific to TriviaQA: in Figure 1.a, EM's instability compared to Contains likely results from references providing multiple correct answers. * The paper lacks sufficient content for an ICLR long paper. I suggest expanding the scope by

Reviewer 04Rating 5Confidence 4

Strengths

* The paper is clear and nicely visualizes the relevant findings * The authors explore a dozen models as judges * The authors use manual annotation to carefully unpack the judge behavior, especially the observation about judge leniency

Weaknesses

* As a reader I'm having difficulties understanding the overarching goals of the paper. Typically researchers use LLMs as a judge for longer form, more subjective questions, where answer coherence, style, and correctness are all part of the judgement. But for TriviaQA, the chosen dataset, the questions have clear, short answers with reference documents, meaning Exact Match is already a strong metric. Here, humans are simply reporting the binary value “correct” vs “incorrect” on the model answers

Reviewer 05Rating 8Confidence 4

Strengths

* Thorough and timely study * Several interesting experiments

Weaknesses

* I would have liked to see more datasets; in fact, I would suggest reducing the number of exam-taker (e.g., I find the base models less interesting) and use different datasets

Code & Models

Repositories

UMass-Meta-LLM-Eval/llm_eval
noneOfficial

Videos

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges· underline

Taxonomy

TopicsLaw, Economics, and Judicial Systems · Legal Education and Practice Innovations · Legal Systems and Judicial Processes

MethodsResidual Connection · Softmax · Balanced Selection · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention