More than Marketing? On the Information Value of AI Benchmarks for   Practitioners

Amelia Hardy; Anka Reuel; Kiana Jafari Meimandi; Lisa Soder; Allie; Griffith; Dylan M. Asmar; Sanmi Koyejo; Michael S. Bernstein; Mykel J.; Kochenderfer

arXiv:2412.05520·cs.AI·December 10, 2024

More than Marketing? On the Information Value of AI Benchmarks for Practitioners

Amelia Hardy, Anka Reuel, Kiana Jafari Meimandi, Lisa Soder, Allie, Griffith, Dylan M. Asmar, Sanmi Koyejo, Michael S. Bernstein, Mykel J., Kochenderfer

PDF

Open Access

TL;DR

This study explores how AI benchmarks influence decision-making among practitioners, revealing their limitations in real-world applications and proposing criteria for more meaningful and robust benchmark design.

Contribution

It provides empirical insights into benchmark usage across domains and offers practical recommendations for developing more effective AI evaluation frameworks.

Findings

01

Benchmarks signal relative performance but vary in decisiveness.

02

Academic settings find benchmarks suitable for research progress.

03

Product and policy domains often find benchmarks inadequate for decisions.

Abstract

Public AI benchmark results are widely broadcast by model developers as indicators of model quality within a growing and competitive market. However, these advertised scores do not necessarily reflect the traits of interest to those who will ultimately apply AI models. In this paper, we seek to understand if and how AI benchmarks are used to inform decision-making. Based on the analyses of interviews with 19 individuals who have used, or decided against using, benchmarks in their day-to-day work, we find that across these settings, participants use benchmarks as a signal of relative performance difference between models. However, whether this signal was considered a definitive sign of model superiority, sufficient for downstream decisions, varied. In academia, public benchmarks were generally viewed as suitable measures for capturing research progress. By contrast, in both product and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)