metabench -- A Sparse Benchmark of Reasoning and Knowledge in Large   Language Models

Alex Kipnis; Konstantinos Voudouris; Luca M. Schulze Buschoff; Eric; Schulz

arXiv:2407.12844·cs.CL·February 21, 2025·1 cites

metabench -- A Sparse Benchmark of Reasoning and Knowledge in Large Language Models

Alex Kipnis, Konstantinos Voudouris, Luca M. Schulze Buschoff, Eric, Schulz

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces metabench, a highly compressed, sparse benchmark derived from six large LLM benchmarks, which efficiently estimates underlying abilities and scores with minimal data, revealing a strong common factor.

Contribution

It presents a novel method to distill large benchmarks into a sparse set of informative items that accurately estimate underlying abilities and scores.

Findings

01

Sparse benchmark achieves less than 3% of original size

02

Estimators reconstruct original scores with less than 1.5% RMSE

03

A single underlying factor strongly correlates with total scores

Abstract

Large Language Models (LLMs) vary in their abilities on a range of tasks. Initiatives such as the Open LLM Leaderboard aim to quantify these differences with several large benchmarks (sets of test items to which an LLM can respond either correctly or incorrectly). However, high correlations within and between benchmark scores suggest that (1) there exists a small set of common underlying abilities that these benchmarks measure, and (2) items tap into redundant information and the benchmarks may thus be considerably compressed. We use data from n > 5000 LLMs to identify the most informative items of six benchmarks, ARC, GSM8K, HellaSwag, MMLU, TruthfulQA and WinoGrande (with d = 28,632 items in total). From them we distill a sparse benchmark, metabench, that has less than 3% of the original size of all six benchmarks combined. This new sparse benchmark goes beyond point scores by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

adkipnis/metabench
noneOfficial

Datasets

HCAI/metabench
dataset· 938 dl
938 dl

Videos

metabench - A Sparse Benchmark of Reasoning and Knowledge in Large Language Models· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsSparse Evolutionary Training