Measuring all the noises of LLM Evals

Sida Wang

arXiv:2512.21326·cs.LG·March 31, 2026

Measuring all the noises of LLM Evals

Sida Wang

PDF

TL;DR

This paper identifies and measures different noise types in LLM evaluations, proposing a paired analysis method to improve statistical power and reliability of results.

Contribution

It defines and quantifies prediction and data noise in LLM evals, introducing an all-pairs paired method for more robust analysis across many models and settings.

Findings

01

Total noise levels are characteristic and predictable for each eval.

02

Prediction noise often exceeds data noise, highlighting the benefit of averaging.

03

Measuring all noise types together enhances the accuracy of empirical conclusions.

Abstract

Separating signal from noise is central to experiments. Applying well-established statistical methods effectively to LLM evals requires consideration of their unique noise characteristics. We clearly define and measure three types of noise: prediction noise from generating different answers on a given question, data noise from sampling questions, and their combined total noise following the law of total variance. To emphasize relative comparisons and gain statistical power, we propose the all-pairs paired method, which applies the paired analysis to all pairs of LLMs and measures all the noise components based on millions of question-level predictions across many evals and settings, revealing clear patterns. First, each eval exhibits a characteristic and highly predictable total noise level across all model pairs. Second, paired prediction noise typically exceeds paired data noise,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.