metabench -- A Sparse Benchmark of Reasoning and Knowledge in Large Language Models
Alex Kipnis, Konstantinos Voudouris, Luca M. Schulze Buschoff, Eric, Schulz

TL;DR
This paper introduces metabench, a highly compressed, sparse benchmark derived from six large LLM benchmarks, which efficiently estimates underlying abilities and scores with minimal data, revealing a strong common factor.
Contribution
It presents a novel method to distill large benchmarks into a sparse set of informative items that accurately estimate underlying abilities and scores.
Findings
Sparse benchmark achieves less than 3% of original size
Estimators reconstruct original scores with less than 1.5% RMSE
A single underlying factor strongly correlates with total scores
Abstract
Large Language Models (LLMs) vary in their abilities on a range of tasks. Initiatives such as the Open LLM Leaderboard aim to quantify these differences with several large benchmarks (sets of test items to which an LLM can respond either correctly or incorrectly). However, high correlations within and between benchmark scores suggest that (1) there exists a small set of common underlying abilities that these benchmarks measure, and (2) items tap into redundant information and the benchmarks may thus be considerably compressed. We use data from n > 5000 LLMs to identify the most informative items of six benchmarks, ARC, GSM8K, HellaSwag, MMLU, TruthfulQA and WinoGrande (with d = 28,632 items in total). From them we distill a sparse benchmark, metabench, that has less than 3% of the original size of all six benchmarks combined. This new sparse benchmark goes beyond point scores by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsSparse Evolutionary Training
