TL;DR
This paper introduces Scales++, an item-centric method for selecting small, representative benchmark subsets based on task item properties, significantly reducing evaluation costs while maintaining accuracy.
Contribution
It proposes a novel item-centric approach for benchmark subset selection, focusing on task item properties rather than model failure patterns, with a practical implementation called Scales++.
Findings
Scales++ reduces selection cost by over 18x.
Predicts full benchmark scores with 3.2% MAE using 0.25% data.
Achieves 2.9% MAE on Humanity's Last Exam with 2.0% data.
Abstract
The prohibitive cost of evaluating large language models (LLMs) on comprehensive benchmarks necessitates the creation of small yet representative data subsets (i.e., tiny benchmarks) that enable efficient assessment while retaining predictive fidelity. Current methods for this task operate under a model-centric paradigm, selecting benchmarking items based on the collective performance of existing models. Such approaches are limited by large upfront costs, an inability to immediately handle new benchmarks ("cold-start"), and the fragile assumption that future models will share the failure patterns of their predecessors. In this work, we propose a new item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves, rather than on model-specific failure patterns. We instantiate this item-centric efficient…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is well presented and easy to follow. - It highlights important issues with current efficient evaluation techniques, some of which make sense to me.
- The contribution is incremental. The approach basically ensembles AnchorPoints, TinyBenchmarks, and GeneralScale. AnchorPoints already evaluated an embedding-based baseline (see "Pretrained" in Table 2). This paper builds on that idea by replacing pretrained embeddings with GPT-4o annotations (General Scales). Following TinyBenchmarks, it also uses a weighted average between two estimators. - The proposed method does not necessarily solve the generalization problem. Previous approaches can in
* Taking on an example-centric view is under-explored for this problem and this work innovatively studies this. The method is also shown to be effective. * The paper also did a great job in introducing prior works and contrasting them with the proposed method.
* Several remaining confusions about methods and experiment settings. See questions below.
- Using item characteristics to predict benchmark performance is a promising direction - The empirical performance of the proposed method appears to be strong
- It is quite difficult to understand the experimental setup for most of the experiments in section 5, making it difficult to evaluate (or reproduce) these experiments. - I suspect that the evaluation of the random baseline in figure 2 is bugged: MAE should approximately decrease as 1/sqrt(n) in the sample size n, but essentially stays constant going from 0.5% to 2% of items. It is unclear, whether the cause of this could affect the evaluation of the more complicated alternatives as well. - Th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
