Noisy but Valid: Robust Statistical Evaluation of LLMs with Imperfect Judges

Chen Feng; Minghe Shen; Ananth Balashankar; Carsten Gerner-Beuerle; Miguel R. D. Rodrigues

arXiv:2601.20913·cs.LG·January 30, 2026

Noisy but Valid: Robust Statistical Evaluation of LLMs with Imperfect Judges

Chen Feng, Minghe Shen, Ananth Balashankar, Carsten Gerner-Beuerle, Miguel R. D. Rodrigues

PDF

Open Access 3 Reviews

TL;DR

This paper develops a statistically valid framework for evaluating large language models using imperfect judges, ensuring error control and quantifying the impact of judge noise on evaluation power.

Contribution

It introduces a variance-corrected hypothesis testing method with theoretical guarantees, validated empirically, and analyzes the performance gap due to judge imperfections.

Findings

01

Finite-sample Type-I error control is guaranteed despite judge noise.

02

Practical methods have significantly lower power than the ideal Oracle.

03

Evaluation power depends on judge quality, dataset size, and certification thresholds.

Abstract

Reliable certification of Large Language Models (LLMs)-verifying that failure rates are below a safety threshold-is critical yet challenging. While "LLM-as-a-Judge" offers scalability, judge imperfections, noise, and bias can invalidate statistical guarantees. We introduce a "Noisy but Valid" hypothesis testing framework to address this. By leveraging a small human-labelled calibration set to estimate the judge's True Positive and False Positive Rates (TPR/FPR), we derive a variance-corrected critical threshold applied to a large judge-labelled dataset. Crucially, our framework theoretically guarantees finite-sample Type-I error control (validity) despite calibration uncertainty. This distinguishes our work from Prediction-Powered Inference (PPI), positioning our method as a diagnostic tool that explicitly models judge behavior rather than a black-box estimator. Our contributions…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 3

Strengths

- The paper addresses an important challenge in evaluating large models, leveraging statistical frameworks such as hypothesis testing. - The method is compared against a prediction-powered inference approach, and the paper notes that PPI often outperforms both oracle-noisy upper bounds. These are good findings.

Weaknesses

- I believe that relying on a small dataset to calibrate the automatic judges will depend a lot on the task, model, and the quality of the collected data. - This paper is very challenging to read. I miss very basic motivations and illustrations of the core ideas of the work. For example, none of the figures make the paper accessible. - The theoretical insights should have made the paper's contributions relevant. On the contrary, these insights are full of jargon and formulations that are not

Reviewer 02Rating 8Confidence 3

Strengths

1. This paper tackles a highly relevant problem of determining the reliability of LLMs. Benchmarks do not reveal the true capabilities of an LLM and the bias of an LLM-as-a-Judge also does not provide reliable insights. I thus feel using such a hypothesis testing framework is very useful and the need of the hour where there are so many options for which LLM can be used. 2. The approach is grounded in reality. It uses a mix of small-scale human data (which is expensive to get) and large-scale LLM

Weaknesses

1. I think discussion around the takeaway of the procedure could be clearer. The findings highlight that noisy hypothesis testing outperforms direct hypothesis testing in certain regimes where the TPR is higher and FPR is lower; what does this mean for the takeaway? An overview of in which scenarios the procedure signs would be very helpful. E.g., also the oracle outperforms all but how realistic is this oracle setting also? It would be nice to incorporate this explicitly. 2. Related to the pre

Reviewer 03Rating 8Confidence 3

Strengths

1. Algorithm 1 includes an explicit critical value with variance terms from judge-parameter estimation; type-I control and type-II bounds are proved (Berry–Esseen based). 2. Experiments cover both classification and generative settings with multiple judges; qualitative alignment with theory increases confidence in the claims.

Weaknesses

1. Critical values use normal/Berry–Esseen approximations. There’s no comparison to Wilson/Clopper-Pearson style bounds when $n_M$ is small or failures are rare—precisely when certification is most needed. 2. The judge prompts and aggregation choices can materially change TPR/FPR. The paper doesn’t investigate prompt variants, majority-vote vs. single-judge sensitivity, or robustness to minor instruction changes, despite known judge prompt sensitivity.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques