Self-Preference Bias in LLM-as-a-Judge
Koki Wataoka, Tsubasa Takahashi, Ryokan Ri

TL;DR
This paper introduces a new metric to quantify self-preference bias in LLM evaluators, revealing that models like GPT-4 favor familiar, low-perplexity outputs, which can skew automated dialogue system assessments.
Contribution
The paper presents a novel quantitative measure for self-preference bias and analyzes its root cause related to output perplexity in LLM evaluations.
Findings
GPT-4 exhibits significant self-preference bias.
LLMs prefer outputs with lower perplexity, regardless of origin.
Bias can skew automated evaluation results.
Abstract
Automated evaluation leveraging large language models (LLMs), commonly referred to as LLM evaluators or LLM-as-a-judge, has been widely used in measuring the performance of dialogue systems. However, the self-preference bias in LLMs has posed significant risks, including promoting specific styles or policies intrinsic to the LLMs. Despite the importance of this issue, there is a lack of established methods to measure the self-preference bias quantitatively, and its underlying causes are poorly understood. In this paper, we introduce a novel quantitative metric to measure the self-preference bias. Our experimental results demonstrate that GPT-4 exhibits a significant degree of self-preference bias. To explore the causes, we hypothesize that LLMs may favor outputs that are more familiar to them, as indicated by lower perplexity. We analyze the relationship between LLM evaluations and the…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The research introduces a new metric for quantifying self-preference bias in LLMs, providing a tool for evaluating the performance of language models in a systematic manner. 2. The study assesses the self-preference bias across eight different LLMs, offering a broad perspective on the prevalence of this bias within various models, particularly highlighting the significant findings related to GPT-4. 3. The paper explores the relationship between LLM evaluations and output perplexity, reveali
1. The formula used to quantify this bias is unreasonable. Using a static probabilistic model may fail to capture the dynamic characteristics of model behavior, affecting the applicability and utility of the findings. Additionally, as shown in Figure 2, GPT-4 correctly identifies a significant number of cases in both True and Predicted values. This leads to a large value in the first term of the formula. GPT-4’s considerably stronger performance compared to other models impacts the bias result.
- The writing good in general, although the content is a bit short, but overall positive. - Although the idea of self-preference bias is not new, this paper proposed an approach to quantify this bias. - This paper did disentangle the positional bias from the self-preference bias during quantification. - The observations on LLM evaluators favor candidates with lower perplexities is insightful and interesting.
- The formula for the metric is valid, but there are some concerns when calculating it: - The main concern for the metric is the class balancing issue. For example, in the Fig.2, there are 108+1852 = 1960 comparisons where gpt-4 wins, but only much less 118+160=278 comparisons where gpt-4 fails. Comparing percentages calculated on them might lead to concern that the bias might be less accurate when true label class is more imbalanced. This means if a LLMs is more preferred compared to others
The continued study of LLMs as evaluation metrics is critical and it is important to continue to study the pros/cons of these metrics. We already know that these LLMs are good at a variety of tasks but if we continue to rely on them as metrics then we are self-reinforcing that the responses they select are good and are then injecting the metric's bias into our system. Therefore the authors providing an explanation as to where the bias is coming from (perplexity) can help the community then think
One concern I have is around how novel are the conclusions and how different is this from previous work. 1. Deutsch et al. (2022) gets cited as work that looked at bias within these LLMs(https://aclanthology.org/2022.emnlp-main.753/). In that work it is mentioned "Not only do they favor the underlying models’ outputs, but they are also biased toward outputs from models which are similar to their own". What sets this paper apart if Deutsch et al. (2022) is already looking at this same type of b
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLaw, Economics, and Judicial Systems · Dispute Resolution and Class Actions · Judicial and Constitutional Studies
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Layer Normalization · Residual Connection · Byte Pair Encoding · Absolute Position Encodings · Multi-Head Attention · Softmax
