Cost-Efficient Estimation of General Abilities Across Benchmarks
Michael Krumdick, Adam Wiemerslage, Seth Ebner, Charles Lovering, Chris Tanner

TL;DR
This paper introduces a cost-efficient benchmarking framework for large language models using a large dataset and adaptive item selection, significantly reducing evaluation costs while maintaining prediction accuracy.
Contribution
It presents a novel approach combining multidimensional item response theory with adaptive selection to predict model performance efficiently across diverse benchmarks.
Findings
Predicts performance on unseen tasks with less than 7% MAE after only 16 items.
Incorporating cost-aware factors reduces evaluation tokens from 141,000 to 22,000.
Demonstrates an 85% reduction in evaluation cost while maintaining accuracy.
Abstract
Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the "Wide-scale Item Level Dataset" (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique items spanning 163 tasks drawn from 27 datasets. This dataset enables the first analysis of how different techniques can predict a model's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
