Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation

David Heineman; Valentin Hofmann; Ian Magnusson; Yuling Gu; Noah A. Smith; Hannaneh Hajishirzi; Kyle Lo; Jesse Dodge

arXiv:2508.13144·cs.CL·August 19, 2025

Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation

David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A. Smith, Hannaneh Hajishirzi, Kyle Lo, Jesse Dodge

PDF

Open Access 2 Datasets 1 Video

TL;DR

This paper introduces metrics for evaluating the reliability of language model benchmarks, emphasizing the importance of high signal-to-noise ratio, and proposes interventions to improve benchmark quality for better model assessment.

Contribution

The paper defines signal and noise metrics for benchmarks, analyzes their impact on model evaluation reliability, and proposes practical interventions to enhance benchmark quality.

Findings

01

Benchmarks with higher signal-to-noise ratios are more reliable for small-scale decisions.

02

Filtering noisy subtasks improves the overall signal-to-noise ratio in evaluations.

03

Averaging intermediate checkpoints reduces noise and enhances evaluation consistency.

Abstract

Developing large language models is expensive and involves making decisions with small experiments, typically by evaluating on large, multi-task evaluation suites. In this work, we analyze specific properties which make a benchmark more reliable for such decisions, and interventions to design higher-quality evaluation benchmarks. We introduce two key metrics that show differences in current benchmarks: signal, a benchmark's ability to separate better models from worse models, and noise, a benchmark's sensitivity to random variability between training steps. We demonstrate that benchmarks with a better signal-to-noise ratio are more reliable when making decisions at small scale, and those with less noise have lower scaling law prediction error. These results suggest that improving signal or noise will lead to more useful benchmarks, so we introduce three interventions designed to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation· slideslive

Taxonomy

TopicsNatural Language Processing Techniques