Cost-Efficient Estimation of General Abilities Across Benchmarks

Michael Krumdick; Adam Wiemerslage; Seth Ebner; Charles Lovering; Chris Tanner

arXiv:2604.01418·cs.CL·April 3, 2026

Cost-Efficient Estimation of General Abilities Across Benchmarks

Michael Krumdick, Adam Wiemerslage, Seth Ebner, Charles Lovering, Chris Tanner

PDF

2 Datasets

TL;DR

This paper introduces a cost-efficient benchmarking framework for large language models using a large dataset and adaptive item selection, significantly reducing evaluation costs while maintaining prediction accuracy.

Contribution

It presents a novel approach combining multidimensional item response theory with adaptive selection to predict model performance efficiently across diverse benchmarks.

Findings

01

Predicts performance on unseen tasks with less than 7% MAE after only 16 items.

02

Incorporating cost-aware factors reduces evaluation tokens from 141,000 to 22,000.

03

Demonstrates an 85% reduction in evaluation cost while maintaining accuracy.

Abstract

Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the "Wide-scale Item Level Dataset" (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique items spanning 163 tasks drawn from 27 datasets. This dataset enables the first analysis of how different techniques can predict a model's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.