Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking
Solomon Messing

TL;DR
This paper reveals that standard confidence intervals in LLM evaluation underestimate true variability, leading to unreliable benchmarks, and proposes methods to account for this hidden measurement error to improve evaluation accuracy.
Contribution
It decomposes sources of uncertainty in LLM pipelines, distinguishes variance types, and introduces a design-study approach to reduce total evaluation error and improve benchmarking reliability.
Findings
Naive standard errors are 40-60% smaller than TEE-corrected SE.
TEE-corrected coverage remains at 95% as sample size grows, unlike naive CIs.
Design-guided pipelines reduce benchmark gaming and improve estimation accuracy.
Abstract
LLM evaluations drive which models get deployed, what safety standards get adopted, which research conclusions get published, and how projections of AI's labor-market impact get made. Yet standard confidence intervals ignore variability from judge model choice, model temperature, and prompt phrasing, producing under-coverage that worsens with more data. The omitted variance can shift results enough to reverse conclusions \citep{baumann2025llmhacking, huang2026dropping}; pipelines that fail to average over it leave the surface that ``benchmark hacking'' exploits \citep{singh2025leaderboard}. This paper decomposes LLM pipeline uncertainty into its sources, distinguishes variance that shrinks with more data from sensitivity to researcher design choices, and uses design-study projections to reduce total evaluation error (TEE). Across the demonstrations, naive standard errors are 40 - 60\%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
