Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno, Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla,, Xiangliang Zhang

TL;DR
This paper systematically quantifies biases in LLM-as-a-Judge using a new framework, revealing persistent biases in advanced models and emphasizing the need for improved reliability and cautious application.
Contribution
Introduces CALM, an automated, principle-guided framework for quantifying and analyzing biases in LLM-as-a-Judge, addressing a critical gap in evaluation reliability.
Findings
Advanced models still exhibit significant biases in specific tasks.
The CALM framework effectively identifies and measures multiple bias types.
Biases impact the reliability and fairness of LLM-as-a-Judge applications.
Abstract
LLM-as-a-Judge has been widely utilized as an evaluation method in various benchmarks and served as supervised rewards in model training. However, despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. Therefore, we identify 12 key potential biases and propose a new automated bias quantification framework-CALM-which systematically quantifies and analyzes each type of bias in LLM-as-a-Judge by using automated and principle-guided modification. Our experiments cover multiple popular language models, and the results indicate that while advanced models have achieved commendable overall performance, significant biases persist in certain specific tasks. Empirical results suggest that there remains room for improvement in the reliability of LLM-as-a-Judge. Moreover, we also discuss the explicit and implicit…
Peer Reviews
Decision·ICLR 2025 Poster
- Originality: The authors expand upon existing work by identifying and categorizing 12 distinct types of biases. - Quality: The paper presents a thorough evaluation of the identified biases across multiple LLMs, using diverse datasets and specific metrics tailored for judging tasks. This rigorous experimental design ensures the reliability and validity of the findings. - Clarity: The examples in Table 1 provide concrete examples of how biases manifest in LLM judgments, making the abstract conce
- I think this paper is more like a toolkit paper rather than a novel research paper, as they just integrate 12 types of existing biases in LLM-as-a-Judge. If we look at the appendix B, we can find that each of the 12 types can be referenced to another previous paper. - The paper primarily relies on automated metrics to assess bias, but human evaluation could provide a valuable additional perspective. Incorporating a human evaluation benchmark would strengthen the validation of the findings.
1. A comprehensive delineation and classification of twelve specific biases that can compromise the dependability and credibility of LLM-as-a-Judge. 2. The proposal of the CALM framework for assessing biases within LLM-as-a-Judge systems, which enhances the rigor of the assessment process in a person-independent manner. 3. An in-depth analysis of six widely-used LLMs through the lens of the CALM framework.
Lack of Transparency in Assessment Criteria: The source of the basis for the assessments of Robustness Rate (RR) and Consistency Rate (CR) is unclear. Incomplete Consideration of Popular Models: The evaluation does not include some well-known LLM as Judge models, such as pandaLM and Prometheus. This omission suggests a lack of thoroughness and may lead to biased or incomplete conclusions. Questionable Data Selection Process: The method for selecting data is not well-defined. For instance, in th
I believe the paper is strong, well-written, and highly comprehensive. The topic is both timely and important, and the NLP/LLM community would greatly benefit from its publication. In my opinion, this paper should be accepted. The reason I initially rated it a 6 instead of an 8 is to encourage the authors to consider revising the metrics (as discussed in the weaknesses section).
**Metrics**: I have a few suggestions regarding the metrics used in this paper and how they are presented in the results. First, for the RR and CR metrics, I recommend making the CR metric more robust by sampling multiple generations when computing the CR for a given instance. Additionally, I propose adjusting the RR metric with the CR, as the authors note that LLMs are non-deterministic, and a low RR score might reflect this rather than a genuine lack of robustness. To adjust the score for each
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLaw, Economics, and Judicial Systems · Judicial and Constitutional Studies · Dispute Resolution and Class Actions
