TL;DR
This study systematically evaluates bias mitigation strategies in LLM-based evaluation models, revealing style bias dominance and the effectiveness of debiasing methods across multiple models and benchmarks.
Contribution
It provides a comprehensive empirical comparison of nine debiasing strategies across diverse LLM judges and benchmarks, highlighting key bias patterns and mitigation effectiveness.
Findings
Style bias is the dominant bias, exceeding position bias.
Debiasing improves evaluation agreement, notably +11.2 pp for Claude Sonnet 4.
Models show a preference for conciseness, but can distinguish quality from length.
Abstract
LLM-as-a-Judge has become the dominant paradigm for evaluating language model outputs, yet LLM judges exhibit systematic biases that compromise evaluation reliability. We present a comprehensive empirical study comparing nine debiasing strategies across five judge models from four provider families (Google, Anthropic, OpenAI, Meta), three benchmarks (MT-Bench n=400, LLMBar n=200, custom n=225), and four bias types. Our key findings: (1) Style bias is the dominant bias (0.76-0.92 across all models), far exceeding position bias (<= 0.04), yet has received minimal research attention. (2) All models show a conciseness preference on expansion pairs, but truncation controls confirm they correctly distinguish quality from length (0.92-1.00 accuracy), suggesting quality-sensitive evaluation rather than a simple length bias. (3) Debiasing is beneficial but model-dependent: the combined budget…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
