Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking

Yang Xu; Jiefu Zhang; Haixiang Sun; Zihan Zhou; Tianyu Cao; Vaneet Aggarwal

arXiv:2605.05973·stat.ML·May 8, 2026

Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking

Yang Xu, Jiefu Zhang, Haixiang Sun, Zihan Zhou, Tianyu Cao, Vaneet Aggarwal

PDF

TL;DR

This paper introduces SIREN, a new evaluation protocol for large language models that corrects biases in adaptive benchmarking, providing more reliable performance estimates within fixed budgets.

Contribution

The paper proposes SIREN, a selection-aware reporting protocol that improves the reliability of LLM evaluation by separating selection from evaluation and quantifying uncertainty.

Findings

01

SIREN provides valid confidence intervals for procedure performance.

02

Winner-based reporting can be overly optimistic and misleading.

03

SIREN closely matches finite-sample reporting targets in experiments.

Abstract

Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level target under explicit tuning budgets. We propose SIREN, a selection-aware repeated-split reporting protocol that freezes the post-search shortlist, separates splitwise selection from held-out evaluation, and uses an item-level Gaussian multiplier bootstrap for uncertainty quantification. In a fixed-shortlist regime with smooth stabilized selection, the estimator admits a first-order item-level representation, and the bootstrap yields valid simultaneous inference on a finite budget grid. This supports confidence intervals for procedure-performance curves and pre-specified equal-budget and cross-budget…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.