BetterBench: Assessing AI Benchmarks, Uncovering Issues, and   Establishing Best Practices

Anka Reuel; Amelia Hardy; Chandler Smith; Max Lamparth and; Malcolm Hardy; Mykel J. Kochenderfer

arXiv:2411.12990·cs.AI·November 21, 2024·2 cites

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices

Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth and, Malcolm Hardy, Mykel J. Kochenderfer

PDF

Open Access 1 Video

TL;DR

This paper introduces a comprehensive framework for assessing AI benchmarks, revealing significant quality issues in existing benchmarks and providing tools and best practices to improve their reliability and usability.

Contribution

It develops a detailed assessment framework for AI benchmarks, evaluates 24 benchmarks against it, and offers a checklist and repository to enhance benchmark quality and comparability.

Findings

01

Large quality differences among benchmarks

02

Most benchmarks lack statistical significance reporting

03

Many benchmarks are not easily replicable

Abstract

AI models are increasingly prevalent in high-stakes environments, necessitating thorough assessment of their capabilities and risks. Benchmarks are popular for measuring these attributes and for comparing model performance, tracking progress, and identifying weaknesses in foundation and non-foundation models. They can inform model selection for downstream tasks and influence policy initiatives. However, not all benchmarks are the same: their quality depends on their design and usability. In this paper, we develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it. We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues. We further find that most benchmarks do not report statistical significance of their results nor allow for their results to be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices· slideslive

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI