# Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations

**Authors:** Muskan Saraf, Sajjad Rezvani Boroujeni, Justin Beaudry, Hossein Abedi, Tom Bush

arXiv: 2508.21164 · 2025-10-13

## TL;DR

This paper investigates how perceived model identity biases evaluations of LLM-generated texts, revealing significant asymmetries and the need for blind, multi-model assessment protocols to ensure fair and valid judgments.

## Contribution

It systematically quantifies label-induced biases in LLM self- and cross-evaluations, highlighting the impact of attribution on judgment consistency and proposing improved evaluation practices.

## Key findings

- Claude scores are inflated by true attribution.
- Gemini scores are systematically depressed by false attribution.
- False attribution can reverse preference rankings and alter quality ratings significantly.

## Abstract

Large language models (LLMs) are increasingly deployed as evaluators of text quality, yet the validity of their judgments remains underexplored. This study investigates systematic bias in self- and cross-model evaluations across three prominent LLMs: ChatGPT, Gemini, and Claude. We designed a controlled experiment in which blog posts authored by each model were evaluated by all three models under four labeling conditions: no attribution, true attribution, and two false-attribution scenarios. Evaluations employed both holistic preference voting and granular quality ratings across three dimensions Coherence, Informativeness, and Conciseness with all scores normalized to percentages for direct comparison. Our findings reveal pronounced asymmetries in model judgments: the "Claude" label consistently elevated scores regardless of actual authorship, while the "Gemini" label systematically depressed them. False attribution frequently reversed preference rankings, producing shifts of up to 50 percentage points in voting outcomes and up to 12 percentage points in quality ratings. Notably, Gemini exhibited severe self-deprecation under true labels, while Claude demonstrated intensified self-preference. These results demonstrate that perceived model identity can substantially distort both high-level judgments and fine-grained quality assessments, independent of content quality. Our findings challenge the reliability of LLM-as-judge paradigms and underscore the critical need for blind evaluation protocols and diverse multi-model validation frameworks to ensure fairness and validity in automated text evaluation and LLM benchmarking.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21164/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21164/full.md

## References

15 references — full list in the complete paper: https://tomesphere.com/paper/2508.21164/full.md

---
Source: https://tomesphere.com/paper/2508.21164