More than Marketing? On the Information Value of AI Benchmarks for Practitioners
Amelia Hardy, Anka Reuel, Kiana Jafari Meimandi, Lisa Soder, Allie, Griffith, Dylan M. Asmar, Sanmi Koyejo, Michael S. Bernstein, Mykel J., Kochenderfer

TL;DR
This study explores how AI benchmarks influence decision-making among practitioners, revealing their limitations in real-world applications and proposing criteria for more meaningful and robust benchmark design.
Contribution
It provides empirical insights into benchmark usage across domains and offers practical recommendations for developing more effective AI evaluation frameworks.
Findings
Benchmarks signal relative performance but vary in decisiveness.
Academic settings find benchmarks suitable for research progress.
Product and policy domains often find benchmarks inadequate for decisions.
Abstract
Public AI benchmark results are widely broadcast by model developers as indicators of model quality within a growing and competitive market. However, these advertised scores do not necessarily reflect the traits of interest to those who will ultimately apply AI models. In this paper, we seek to understand if and how AI benchmarks are used to inform decision-making. Based on the analyses of interviews with 19 individuals who have used, or decided against using, benchmarks in their day-to-day work, we find that across these settings, participants use benchmarks as a signal of relative performance difference between models. However, whether this signal was considered a definitive sign of model superiority, sufficient for downstream decisions, varied. In academia, public benchmarks were generally viewed as suitable measures for capturing research progress. By contrast, in both product and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
