Efficient Inference for Noisy LLM-as-a-Judge Evaluation

Yiqun T Chen; Sizhu Lu; Sijia Li; Moran Guo; Shengyi Li

arXiv:2601.05420·cs.LG·January 12, 2026

Efficient Inference for Noisy LLM-as-a-Judge Evaluation

Yiqun T Chen, Sizhu Lu, Sijia Li, Moran Guo, Shengyi Li

PDF

Open Access

TL;DR

This paper analyzes methods to improve the accuracy of LLM-based evaluations by comparing measurement-error correction and prediction-powered inference, providing theoretical insights, simulations, and real-data demonstrations.

Contribution

It unifies and compares two correction approaches using semiparametric efficiency theory, identifying conditions where PPI outperforms measurement-error correction.

Findings

01

PPI-style estimators can have lower asymptotic variance than measurement-error corrections.

02

Theoretical derivations of efficient estimators using influence functions.

03

Validated methods through simulations and real-data examples.

Abstract

Large language models (LLMs) are increasingly used as automatic evaluators of generative AI outputs, a paradigm often referred to as "LLM-as-a-judge." In practice, LLM judges are imperfect predictions for the underlying truth and can exhibit systematic, non-random errors. Two main approaches have recently been proposed to address this issue: (i) direct measurementerror correction based on misclassification models such as Rogan-Gladen-style estimators, and (ii) surrogate-outcome approaches such as prediction-powered inference (PPI), which correct bias by calibrating prediction residuals on a small set of gold-standard human labels. In this paper, we systematically study the performance of these two approaches for estimating mean parameters (e.g., average benchmark scores or pairwise win rates). Leveraging tools from semiparametric efficiency theory, we unify the two classes of estimators…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Ethics and Social Impacts of AI