How Benchmark Prediction from Fewer Data Misses the Mark

Guanhua Zhang; Florian E. Dorner; Moritz Hardt

arXiv:2506.07673·cs.LG·June 10, 2025

How Benchmark Prediction from Fewer Data Misses the Mark

Guanhua Zhang, Florian E. Dorner, Moritz Hardt

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper critically evaluates 11 benchmark prediction methods for LLM evaluation, revealing their dependence on model similarity and introducing a new method that modestly improves extrapolation performance, yet highlighting fundamental limitations at the evaluation frontier.

Contribution

It systematically assesses existing benchmark prediction methods, identifies a strong simple baseline, and proposes a new weighting method to improve extrapolation, exposing key limitations.

Findings

01

Random sampling with regression is a strong baseline.

02

Existing methods depend heavily on model similarity.

03

Benchmark prediction struggles with extrapolating to new, more capable models.

Abstract

Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM evaluation) aims to select a small subset of evaluation points and predict overall benchmark performance from that subset. In this paper, we systematically assess the strengths and limitations of 11 benchmark prediction methods across 19 diverse benchmarks. First, we identify a highly competitive baseline: Take a random sample and fit a regression model on the sample to predict missing entries. Outperforming most existing methods, this baseline challenges the assumption that careful subset selection is necessary for benchmark prediction. Second, we discover that all existing methods crucially depend on model similarity. They work best when interpolating scores among similar models. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

socialfoundations/benchmark-prediction
noneOfficial

Videos

How Benchmark Prediction from Fewer Data Misses the Mark· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings