BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices
Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth and, Malcolm Hardy, Mykel J. Kochenderfer

TL;DR
This paper introduces a comprehensive framework for assessing AI benchmarks, revealing significant quality issues in existing benchmarks and providing tools and best practices to improve their reliability and usability.
Contribution
It develops a detailed assessment framework for AI benchmarks, evaluates 24 benchmarks against it, and offers a checklist and repository to enhance benchmark quality and comparability.
Findings
Large quality differences among benchmarks
Most benchmarks lack statistical significance reporting
Many benchmarks are not easily replicable
Abstract
AI models are increasingly prevalent in high-stakes environments, necessitating thorough assessment of their capabilities and risks. Benchmarks are popular for measuring these attributes and for comparing model performance, tracking progress, and identifying weaknesses in foundation and non-foundation models. They can inform model selection for downstream tasks and influence policy initiatives. However, not all benchmarks are the same: their quality depends on their design and usability. In this paper, we develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it. We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues. We further find that most benchmarks do not report statistical significance of their results nor allow for their results to be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
