Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuwke Hupkes

TL;DR
This study evaluates the effectiveness and vulnerabilities of large language models acting as judges for other models' outputs, revealing their limitations in alignment with human judgment and highlighting potential biases and weaknesses.
Contribution
It provides a comprehensive analysis of LLMs as judges, identifying their strengths, weaknesses, and vulnerabilities, and emphasizes the need for careful use in complex evaluation scenarios.
Findings
Only the largest models achieve reasonable alignment with humans.
Judge models often differ from humans by up to 5 points in scoring.
Vulnerabilities include sensitivity to prompt complexity and leniency tendencies.
Abstract
Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges, focusing on a clean scenario in which inter-human agreement is high. Investigating thirteen judge models of different model sizes and families, judging answers of nine different 'examtaker models' - both base and instruction-tuned - we find that only the best (and largest) models achieve reasonable alignment with humans. However, they are still quite far behind inter-human agreement and their assigned scores may still differ with up to 5 points…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The provided analysis regarding the LLM judges' sensitivity to prompts, error types, a lack of robustness, and the leniency bias are interesting and valuable to future studies. 2. The paper is well-written and the findings are clearly presented.
It appears that some of the main findings in this work are either not well-supported, may lack generalizability, or have been discussed in previous work. 1. Lack of generalizability: The task setting of the LLM judges selected in this work is reference-based evaluation of QA, which differs from the common application scenario where LLM judges evaluate various tasks without a gold reference (e.g., AlpacaEval, Arena Hard). Access to gold references makes the evaluation task significantly easier.
- They focus on a specific scenario with high inter-human agreement which is an attempt to isolate the judge model behavior from task ambiguity. - Several dimensions are explored: 1) model sizes and families, 2) Multiple metrics, 3) Error analysis provided. - Insights such as ``smaller models can rank exam-takers as effectively as larger ones'', and the attempted explanation that "chat models may "unlearn" some knowledge during alignment"; - The work also provides some recommendations for pra
- The scope remains limited to TriviaQA. For "short, factual" answers, consider adding the "LLMBar" datasets, which have high human agreement rates > 90%. Sufficient examples can be used according to your dataset selection criteria [1]. Without the inclusion of additional datasets [1], it remains unclear how well the ranking ability would transfer. - The original claim (line 316-318) about judge performance being worse at identifying correct answers could be an artifact of including metrics tha
* The explanation of why Scott's Pi should be used instead of Kappa in judging-the-judges scenarios is a significant contribution that will benefit future researchers. * The comprehensive analysis across multiple dimensions (alignment metrics, ranking correlation, error analysis) provides valuable insights into the strengths and limitations of different judge models. * The comparison between LLM judges and lexical judges (EM, Contains) offers a novel and important perspective. This insight becom
* The evaluation relies solely on TriviaQA, making it difficult to deconfound the root cause: whether the best model's performance stems from better alignment, knowledge of TriviaQA content, or simply being favored by other LLMs. Other unusual findings may also be specific to TriviaQA: in Figure 1.a, EM's instability compared to Contains likely results from references providing multiple correct answers. * The paper lacks sufficient content for an ICLR long paper. I suggest expanding the scope by
* The paper is clear and nicely visualizes the relevant findings * The authors explore a dozen models as judges * The authors use manual annotation to carefully unpack the judge behavior, especially the observation about judge leniency
* As a reader I'm having difficulties understanding the overarching goals of the paper. Typically researchers use LLMs as a judge for longer form, more subjective questions, where answer coherence, style, and correctness are all part of the judgement. But for TriviaQA, the chosen dataset, the questions have clear, short answers with reference documents, meaning Exact Match is already a strong metric. Here, humans are simply reporting the binary value “correct” vs “incorrect” on the model answers
* Thorough and timely study * Several interesting experiments
* I would have liked to see more datasets; in fact, I would suggest reducing the number of exam-taker (e.g., I find the base models less interesting) and use different datasets
Code & Models
Videos
Taxonomy
TopicsLaw, Economics, and Judicial Systems · Legal Education and Practice Innovations · Legal Systems and Judicial Processes
MethodsResidual Connection · Softmax · Balanced Selection · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention
