Unsteady Metrics and Benchmarking Cultures of AI Model Builders

Stefan Baack; Christo Buschek; and Maty Bohacek

arXiv:2605.14164·cs.AI·May 15, 2026

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

Stefan Baack, Christo Buschek, and Maty Bohacek

PDF

1 Repo 1 Datasets

TL;DR

This paper examines how AI model builders select and highlight benchmarks in public communications, revealing a fragmented evaluation landscape and proposing a taxonomy to understand their narratives.

Contribution

It introduces Benchmarking-Cultures-25 dataset, analyzes benchmark selection practices, and develops a taxonomy to interpret diverse evaluation narratives in AI.

Findings

01

63.2% of highlighted benchmarks are used by only one builder

02

38.5% of benchmarks appear in just one release

03

Many benchmarks emphasize progress toward AGI over scientific validity

Abstract

The primary way to establish and compare competencies in foundation and generative AI models has shifted from peer-reviewed literature to press releases and company blog posts, where model builders highlight results on selected benchmarks. These artifacts now largely define the state of the art for researchers and the public. Despite their prominence, which benchmarks model builders choose to highlight, and what they communicate through this selection, is underexamined. To investigate, we introduce and open-source Benchmarking-Cultures-25, a dataset of 231 benchmarks highlighted across 139 model releases in 2025 from 11 major AI builders, alongside an interactive tool to explore the data. Our analysis reveals a fragmented evaluation landscape with limited cross-model comparability: 63.2% of highlighted benchmarks are used by a single builder, and 38.5% appear in just one release. Few…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://hf.co/datasets/matybohacek/benchmarking-cultures-25
github

Datasets

matybohacek/benchmarking-cultures-25
dataset· 381 dl
381 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.