Unsteady Metrics and Benchmarking Cultures of AI Model Builders
Stefan Baack, Christo Buschek, and Maty Bohacek

TL;DR
This paper examines how AI model builders select and highlight benchmarks in public communications, revealing a fragmented evaluation landscape and proposing a taxonomy to understand their narratives.
Contribution
It introduces Benchmarking-Cultures-25 dataset, analyzes benchmark selection practices, and develops a taxonomy to interpret diverse evaluation narratives in AI.
Findings
63.2% of highlighted benchmarks are used by only one builder
38.5% of benchmarks appear in just one release
Many benchmarks emphasize progress toward AGI over scientific validity
Abstract
The primary way to establish and compare competencies in foundation and generative AI models has shifted from peer-reviewed literature to press releases and company blog posts, where model builders highlight results on selected benchmarks. These artifacts now largely define the state of the art for researchers and the public. Despite their prominence, which benchmarks model builders choose to highlight, and what they communicate through this selection, is underexamined. To investigate, we introduce and open-source Benchmarking-Cultures-25, a dataset of 231 benchmarks highlighted across 139 model releases in 2025 from 11 major AI builders, alongside an interactive tool to explore the data. Our analysis reveals a fragmented evaluation landscape with limited cross-model comparability: 63.2% of highlighted benchmarks are used by a single builder, and 38.5% appear in just one release. Few…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
