Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking
Yang Xu, Jiefu Zhang, Haixiang Sun, Zihan Zhou, Tianyu Cao, Vaneet Aggarwal

TL;DR
This paper introduces SIREN, a new evaluation protocol for large language models that corrects biases in adaptive benchmarking, providing more reliable performance estimates within fixed budgets.
Contribution
The paper proposes SIREN, a selection-aware reporting protocol that improves the reliability of LLM evaluation by separating selection from evaluation and quantifying uncertainty.
Findings
SIREN provides valid confidence intervals for procedure performance.
Winner-based reporting can be overly optimistic and misleading.
SIREN closely matches finite-sample reporting targets in experiments.
Abstract
Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level target under explicit tuning budgets. We propose SIREN, a selection-aware repeated-split reporting protocol that freezes the post-search shortlist, separates splitwise selection from held-out evaluation, and uses an item-level Gaussian multiplier bootstrap for uncertainty quantification. In a fixed-shortlist regime with smooth stabilized selection, the estimator admits a first-order item-level representation, and the bootstrap yields valid simultaneous inference on a finite budget grid. This supports confidence intervals for procedure-performance curves and pre-specified equal-budget and cross-budget…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
