Efficient Inference for Noisy LLM-as-a-Judge Evaluation
Yiqun T Chen, Sizhu Lu, Sijia Li, Moran Guo, Shengyi Li

TL;DR
This paper analyzes methods to improve the accuracy of LLM-based evaluations by comparing measurement-error correction and prediction-powered inference, providing theoretical insights, simulations, and real-data demonstrations.
Contribution
It unifies and compares two correction approaches using semiparametric efficiency theory, identifying conditions where PPI outperforms measurement-error correction.
Findings
PPI-style estimators can have lower asymptotic variance than measurement-error corrections.
Theoretical derivations of efficient estimators using influence functions.
Validated methods through simulations and real-data examples.
Abstract
Large language models (LLMs) are increasingly used as automatic evaluators of generative AI outputs, a paradigm often referred to as "LLM-as-a-judge." In practice, LLM judges are imperfect predictions for the underlying truth and can exhibit systematic, non-random errors. Two main approaches have recently been proposed to address this issue: (i) direct measurementerror correction based on misclassification models such as Rogan-Gladen-style estimators, and (ii) surrogate-outcome approaches such as prediction-powered inference (PPI), which correct bias by calibrating prediction residuals on a small set of gold-standard human labels. In this paper, we systematically study the performance of these two approaches for estimating mean parameters (e.g., average benchmark scores or pairwise win rates). Leveraging tools from semiparametric efficiency theory, we unify the two classes of estimators…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Ethics and Social Impacts of AI
