Noisy but Valid: Robust Statistical Evaluation of LLMs with Imperfect Judges
Chen Feng, Minghe Shen, Ananth Balashankar, Carsten Gerner-Beuerle, Miguel R. D. Rodrigues

TL;DR
This paper develops a statistically valid framework for evaluating large language models using imperfect judges, ensuring error control and quantifying the impact of judge noise on evaluation power.
Contribution
It introduces a variance-corrected hypothesis testing method with theoretical guarantees, validated empirically, and analyzes the performance gap due to judge imperfections.
Findings
Finite-sample Type-I error control is guaranteed despite judge noise.
Practical methods have significantly lower power than the ideal Oracle.
Evaluation power depends on judge quality, dataset size, and certification thresholds.
Abstract
Reliable certification of Large Language Models (LLMs)-verifying that failure rates are below a safety threshold-is critical yet challenging. While "LLM-as-a-Judge" offers scalability, judge imperfections, noise, and bias can invalidate statistical guarantees. We introduce a "Noisy but Valid" hypothesis testing framework to address this. By leveraging a small human-labelled calibration set to estimate the judge's True Positive and False Positive Rates (TPR/FPR), we derive a variance-corrected critical threshold applied to a large judge-labelled dataset. Crucially, our framework theoretically guarantees finite-sample Type-I error control (validity) despite calibration uncertainty. This distinguishes our work from Prediction-Powered Inference (PPI), positioning our method as a diagnostic tool that explicitly models judge behavior rather than a black-box estimator. Our contributions…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper addresses an important challenge in evaluating large models, leveraging statistical frameworks such as hypothesis testing. - The method is compared against a prediction-powered inference approach, and the paper notes that PPI often outperforms both oracle-noisy upper bounds. These are good findings.
- I believe that relying on a small dataset to calibrate the automatic judges will depend a lot on the task, model, and the quality of the collected data. - This paper is very challenging to read. I miss very basic motivations and illustrations of the core ideas of the work. For example, none of the figures make the paper accessible. - The theoretical insights should have made the paper's contributions relevant. On the contrary, these insights are full of jargon and formulations that are not
1. This paper tackles a highly relevant problem of determining the reliability of LLMs. Benchmarks do not reveal the true capabilities of an LLM and the bias of an LLM-as-a-Judge also does not provide reliable insights. I thus feel using such a hypothesis testing framework is very useful and the need of the hour where there are so many options for which LLM can be used. 2. The approach is grounded in reality. It uses a mix of small-scale human data (which is expensive to get) and large-scale LLM
1. I think discussion around the takeaway of the procedure could be clearer. The findings highlight that noisy hypothesis testing outperforms direct hypothesis testing in certain regimes where the TPR is higher and FPR is lower; what does this mean for the takeaway? An overview of in which scenarios the procedure signs would be very helpful. E.g., also the oracle outperforms all but how realistic is this oracle setting also? It would be nice to incorporate this explicitly. 2. Related to the pre
1. Algorithm 1 includes an explicit critical value with variance terms from judge-parameter estimation; type-I control and type-II bounds are proved (Berry–Esseen based). 2. Experiments cover both classification and generative settings with multiple judges; qualitative alignment with theory increases confidence in the claims.
1. Critical values use normal/Berry–Esseen approximations. There’s no comparison to Wilson/Clopper-Pearson style bounds when $n_M$ is small or failures are rare—precisely when certification is most needed. 2. The judge prompts and aggregation choices can materially change TPR/FPR. The paper doesn’t investigate prompt variants, majority-vote vs. single-judge sensitivity, or robustness to minor instruction changes, despite known judge prompt sensitivity.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques
