ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

Adhiraj Ghosh; Sebastian Dziadzio; Ameya Prabhu; Vishaal Udandarao; Samuel Albanie; Matthias Bethge

arXiv:2412.06745·cs.LG·June 18, 2025

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

Adhiraj Ghosh, Sebastian Dziadzio, Ameya Prabhu, Vishaal Udandarao, Samuel Albanie, Matthias Bethge

PDF

Open Access 1 Video

TL;DR

This paper introduces ONEBench, a flexible, sample-level benchmarking framework that enables open-ended, customizable evaluation of foundation models across diverse capabilities, reducing bias and evaluation costs.

Contribution

It proposes a novel benchmarking paradigm that aggregates diverse, incomplete data into reliable model scores, facilitating continuous, open-ended evaluation of models.

Findings

01

Aggregation algorithm ensures reliable model ranking

02

Robust to 95% missing measurements

03

Reduces evaluation cost by up to 20x

Abstract

Traditional fixed test sets fall short in evaluating open-ended capabilities of foundation models. To address this, we propose ONEBench(OpeN-Ended Benchmarking), a new testing paradigm that consolidates individual evaluation datasets into a unified, ever-expanding sample pool. ONEBench allows users to generate custom, open-ended evaluation benchmarks from this pool, corresponding to specific capabilities of interest. By aggregating samples across test sets, ONEBench enables the assessment of diverse capabilities beyond those covered by the original test sets, while mitigating overfitting and dataset bias. Most importantly, it frames model evaluation as a collective process of selecting and aggregating sample-level tests. The shift from task-specific benchmarks to ONEBench introduces two challenges: (1)heterogeneity and (2)incompleteness. Heterogeneity refers to the aggregation over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities· underline

Taxonomy

TopicsBayesian Modeling and Causal Inference