Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

Sadman Kabir Soumik

arXiv:2604.23178·cs.AI·April 28, 2026

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

Sadman Kabir Soumik

PDF

1 Repo

TL;DR

This study systematically evaluates bias mitigation strategies in LLM-based evaluation models, revealing style bias dominance and the effectiveness of debiasing methods across multiple models and benchmarks.

Contribution

It provides a comprehensive empirical comparison of nine debiasing strategies across diverse LLM judges and benchmarks, highlighting key bias patterns and mitigation effectiveness.

Findings

01

Style bias is the dominant bias, exceeding position bias.

02

Debiasing improves evaluation agreement, notably +11.2 pp for Claude Sonnet 4.

03

Models show a preference for conciseness, but can distinguish quality from length.

Abstract

LLM-as-a-Judge has become the dominant paradigm for evaluating language model outputs, yet LLM judges exhibit systematic biases that compromise evaluation reliability. We present a comprehensive empirical study comparing nine debiasing strategies across five judge models from four provider families (Google, Anthropic, OpenAI, Meta), three benchmarks (MT-Bench n=400, LLMBar n=200, custom n=225), and four bias types. Our key findings: (1) Style bias is the dominant bias (0.76-0.92 across all models), far exceeding position bias (<= 0.04), yet has received minimal research attention. (2) All models show a conciseness preference on expansion pairs, but truncation controls confirm they correctly distinguish quality from length (0.92-1.00 accuracy), suggesting quality-sensitive evaluation rather than a simple length bias. (3) Debiasing is beneficial but model-dependent: the combined budget…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sksoumik/llm-as-judge
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.