Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization
Elad Tolochinsky, Yaniv Tenzer, Yaniv Romano

TL;DR
This paper introduces a statistically valid framework combining multi-armed bandit algorithms with low-rank score predictions to efficiently and accurately identify the best large language model with fewer evaluations.
Contribution
It develops a doubly robust estimator that integrates low-rank predictions into bandit algorithms, ensuring valid confidence intervals and reducing evaluation costs.
Findings
Reduces the number of LLM evaluations needed to identify the best model.
Provides statistically valid confidence intervals despite using biased score predictions.
Achieves cost savings while maintaining accurate model selection.
Abstract
Selecting the best large language model (LLM) for a fixed benchmark is often expensive, since exhaustive evaluation requires running every model on every example. Multi-armed bandit (MAB) algorithms can reduce the number of LLM calls by sequentially selecting the next model-example pair to evaluate, thereby avoiding wasted evaluations on clearly underperforming models. Further savings can be achieved by predicting model scores from the partially observed model-example score matrix using low-rank factorization. However, such predictions are not ground truth: they can be biased and may therefore lead to incorrect identification of the best model. In this work, we propose a principled framework that combines MAB with cheap predicted scores without compromising statistical validity. Specifically, we derive doubly robust estimators of each model's performance that use the low-rank…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
