How Benchmark Prediction from Fewer Data Misses the Mark
Guanhua Zhang, Florian E. Dorner, Moritz Hardt

TL;DR
This paper critically evaluates 11 benchmark prediction methods for LLM evaluation, revealing their dependence on model similarity and introducing a new method that modestly improves extrapolation performance, yet highlighting fundamental limitations at the evaluation frontier.
Contribution
It systematically assesses existing benchmark prediction methods, identifies a strong simple baseline, and proposes a new weighting method to improve extrapolation, exposing key limitations.
Findings
Random sampling with regression is a strong baseline.
Existing methods depend heavily on model similarity.
Benchmark prediction struggles with extrapolating to new, more capable models.
Abstract
Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM evaluation) aims to select a small subset of evaluation points and predict overall benchmark performance from that subset. In this paper, we systematically assess the strengths and limitations of 11 benchmark prediction methods across 19 diverse benchmarks. First, we identify a highly competitive baseline: Take a random sample and fit a regression model on the sample to predict missing entries. Outperforming most existing methods, this baseline challenges the assumption that careful subset selection is necessary for benchmark prediction. Second, we discover that all existing methods crucially depend on model similarity. They work best when interpolating scores among similar models. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
