When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

Mubashara Akhtar; Anka Reuel; Prajna Soni; Sanchit Ahuja; Pawan Sasanka Ammanamanchi; Ruchit Rawal; Vil\'em Zouhar; Srishti Yadav; Chenxi Whitehouse; Dayeon Ki; Jennifer Mickel; Leshem Choshen; Marek \v{S}uppa; Jan Batzner; Jenny Chim; Jeba Sania; Yanan Long; Hossein A. Rahmani; Christina Knight; Yiyang Nan; Jyoutir Raj; Yu Fan; Shubham Singh; Subramanyam Sahoo; Eliya Habba; Usman Gohar; Siddhesh Pawar; Robert Scholz; Arjun Subramonian; Jingwei Ni; Mykel Kochenderfer; Sanmi Koyejo; Mrinmaya Sachan; Stella Biderman; Zeerak Talat; Avijit Ghosh; Irene Solaiman

arXiv:2602.16763·cs.AI·February 20, 2026

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

Mubashara Akhtar, Anka Reuel, Prajna Soni, Sanchit Ahuja, Pawan Sasanka Ammanamanchi, Ruchit Rawal, Vil\'em Zouhar, Srishti Yadav, Chenxi Whitehouse, Dayeon Ki, Jennifer Mickel, Leshem Choshen, Marek \v{S}uppa, Jan Batzner, Jenny Chim, Jeba Sania, Yanan Long, Hossein A. Rahmani

PDF

Open Access

TL;DR

This paper systematically studies the saturation of AI benchmarks, analyzing 60 LLM benchmarks to identify factors that influence how quickly they become less effective at differentiating model performance.

Contribution

It provides a comprehensive analysis of benchmark saturation, identifying key properties that affect longevity and offering insights for designing more durable AI evaluation benchmarks.

Findings

01

Nearly half of benchmarks show saturation, increasing with age.

02

Expert-curated benchmarks resist saturation better than crowdsourced ones.

03

Hiding test data does not prevent benchmark saturation.

Abstract

Artificial Intelligence (AI) benchmarks play a central role in measuring progress in model development and guiding deployment decisions. However, many benchmarks quickly become saturated, meaning that they can no longer differentiate between the best-performing models, diminishing their long-term value. In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers. To identify factors driving saturation, we characterize benchmarks along 14 properties spanning task design, data construction, and evaluation format. We test five hypotheses examining how each property contributes to saturation rates. Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age. Notably, hiding test data (i.e., public vs. private) shows no protective effect, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)