Self-Preference Bias in LLM-as-a-Judge

Koki Wataoka; Tsubasa Takahashi; Ryokan Ri

arXiv:2410.21819·cs.CL·June 24, 2025·3 cites

Self-Preference Bias in LLM-as-a-Judge

Koki Wataoka, Tsubasa Takahashi, Ryokan Ri

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a new metric to quantify self-preference bias in LLM evaluators, revealing that models like GPT-4 favor familiar, low-perplexity outputs, which can skew automated dialogue system assessments.

Contribution

The paper presents a novel quantitative measure for self-preference bias and analyzes its root cause related to output perplexity in LLM evaluations.

Findings

01

GPT-4 exhibits significant self-preference bias.

02

LLMs prefer outputs with lower perplexity, regardless of origin.

03

Bias can skew automated evaluation results.

Abstract

Automated evaluation leveraging large language models (LLMs), commonly referred to as LLM evaluators or LLM-as-a-judge, has been widely used in measuring the performance of dialogue systems. However, the self-preference bias in LLMs has posed significant risks, including promoting specific styles or policies intrinsic to the LLMs. Despite the importance of this issue, there is a lack of established methods to measure the self-preference bias quantitatively, and its underlying causes are poorly understood. In this paper, we introduce a novel quantitative metric to measure the self-preference bias. Our experimental results demonstrate that GPT-4 exhibits a significant degree of self-preference bias. To explore the causes, we hypothesize that LLMs may favor outputs that are more familiar to them, as indicated by lower perplexity. We analyze the relationship between LLM evaluations and the…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 4

Strengths

1. The research introduces a new metric for quantifying self-preference bias in LLMs, providing a tool for evaluating the performance of language models in a systematic manner. 2. The study assesses the self-preference bias across eight different LLMs, offering a broad perspective on the prevalence of this bias within various models, particularly highlighting the significant findings related to GPT-4. 3. The paper explores the relationship between LLM evaluations and output perplexity, reveali

Weaknesses

1. The formula used to quantify this bias is unreasonable. Using a static probabilistic model may fail to capture the dynamic characteristics of model behavior, affecting the applicability and utility of the findings. Additionally, as shown in Figure 2, GPT-4 correctly identifies a significant number of cases in both True and Predicted values. This leads to a large value in the first term of the formula. GPT-4’s considerably stronger performance compared to other models impacts the bias result.

Reviewer 02Rating 6Confidence 4

Strengths

- The writing good in general, although the content is a bit short, but overall positive. - Although the idea of self-preference bias is not new, this paper proposed an approach to quantify this bias. - This paper did disentangle the positional bias from the self-preference bias during quantification. - The observations on LLM evaluators favor candidates with lower perplexities is insightful and interesting.

Weaknesses

- The formula for the metric is valid, but there are some concerns when calculating it: - The main concern for the metric is the class balancing issue. For example, in the Fig.2, there are 108+1852 = 1960 comparisons where gpt-4 wins, but only much less 118+160=278 comparisons where gpt-4 fails. Comparing percentages calculated on them might lead to concern that the bias might be less accurate when true label class is more imbalanced. This means if a LLMs is more preferred compared to others

Reviewer 03Rating 6Confidence 4

Strengths

The continued study of LLMs as evaluation metrics is critical and it is important to continue to study the pros/cons of these metrics. We already know that these LLMs are good at a variety of tasks but if we continue to rely on them as metrics then we are self-reinforcing that the responses they select are good and are then injecting the metric's bias into our system. Therefore the authors providing an explanation as to where the bias is coming from (perplexity) can help the community then think

Weaknesses

One concern I have is around how novel are the conclusions and how different is this from previous work. 1. Deutsch et al. (2022) gets cited as work that looked at bias within these LLMs(https://aclanthology.org/2022.emnlp-main.753/). In that work it is mentioned "Not only do they favor the underlying models’ outputs, but they are also biased toward outputs from models which are similar to their own". What sets this paper apart if Deutsch et al. (2022) is already looking at this same type of b

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLaw, Economics, and Judicial Systems · Dispute Resolution and Class Actions · Judicial and Constitutional Studies

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Layer Normalization · Residual Connection · Byte Pair Encoding · Absolute Position Encodings · Multi-Head Attention · Softmax