Bias and Uncertainty in LLM-as-a-Judge Estimation

James Fiedler

arXiv:2605.06939·cs.LG·May 11, 2026

Bias and Uncertainty in LLM-as-a-Judge Estimation

James Fiedler

PDF

TL;DR

This paper investigates biases and reliability issues in using large language models as judges for model evaluation, highlighting failure modes and proposing diagnostics for improved assessment accuracy.

Contribution

It analyzes failure modes of bias correction methods in LLM-based evaluation, introduces diagnostics for judge quality and calibration stability, and offers reporting guidance.

Findings

01

Naive judge outputs are systematically biased.

02

Shared calibration can cause severe bias and sign reversal.

03

Proposed diagnostics J and ΔJ help assess estimate reliability.

Abstract

LLM-as-a-Judge evaluation has become a standard tool for assessing base model performance. However, characterizing performance via the naive estimator, i.e., raw judge outputs, is systematically biased. Recent work has proposed estimators to correct this bias, but their reliability depends critically on judge quality and, for model comparisons, on calibration stability. Sharing calibration across compared models is practically attractive but can introduce severe bias, including cases where the comparison estimate points in the wrong direction with high apparent confidence. We study these failure modes through analytical results, simulations over judge quality ( $J$ ) and cross-model calibration instability ( $Δ J$ ), and a real-data MMLU-Pro case study with sign reversal. We propose $J$ and $Δ J$ as diagnostics for when corrected estimates, especially shared-calibration…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.