Bias and Uncertainty in LLM-as-a-Judge Estimation
James Fiedler

TL;DR
This paper investigates biases and reliability issues in using large language models as judges for model evaluation, highlighting failure modes and proposing diagnostics for improved assessment accuracy.
Contribution
It analyzes failure modes of bias correction methods in LLM-based evaluation, introduces diagnostics for judge quality and calibration stability, and offers reporting guidance.
Findings
Naive judge outputs are systematically biased.
Shared calibration can cause severe bias and sign reversal.
Proposed diagnostics J and ΔJ help assess estimate reliability.
Abstract
LLM-as-a-Judge evaluation has become a standard tool for assessing base model performance. However, characterizing performance via the naive estimator, i.e., raw judge outputs, is systematically biased. Recent work has proposed estimators to correct this bias, but their reliability depends critically on judge quality and, for model comparisons, on calibration stability. Sharing calibration across compared models is practically attractive but can introduce severe bias, including cases where the comparison estimate points in the wrong direction with high apparent confidence. We study these failure modes through analytical results, simulations over judge quality () and cross-model calibration instability (), and a real-data MMLU-Pro case study with sign reversal. We propose and as diagnostics for when corrected estimates, especially shared-calibration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
