State-of-the-Art Claims Require State-of-the-Art Evidence
YongKyung Oh

TL;DR
This paper highlights the gap between SOTA claims and actual evidence in AI benchmarking, revealing that many top models' supposed superiority is often based on fragile, outlier-driven aggregate scores rather than consistent, meaningful improvements.
Contribution
It critically examines the validity of SOTA claims in AI benchmarks, emphasizing the need for more honest and precise reporting of evidence beyond mean score improvements.
Findings
Over half of top-model comparisons lack properties like effect size or robustness.
Aggregate gains are often driven by outlier datasets.
Fragility of claims persists even in benchmarks with many tasks.
Abstract
State-of-the-Art (SOTA) claims pervade Artificial Intelligence (AI) and Machine Learning (ML) research. These claims rest on benchmark evaluations, where models are ranked by aggregate scores across tasks. Public benchmarks or leaderboards are the most visible instance, but the same structure appears in paper tables throughout the literature. However, such minimal evidence often cannot support these strong claims. We identify a widespread claim-evidence gap in AI benchmarking. Claiming SOTA carries implicit assumptions beyond mean score superiority, suggesting that a model meaningfully outperforms alternatives across most tasks. However, a marginal improvement in the mean score merely indicates a top average rank rather than true superiority. Analyzing ten cross-domain benchmarks from public leaderboards, we found that in more than half of top-model comparisons, at least one commonly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
