Measurement to Meaning: A Validity-Centered Framework for AI Evaluation

Olawale Salaudeen; Anka Reuel; Ahmed Ahmed; Suhana Bedi; Zachary Robertson; Sudharsan Sundar; Ben Domingue; Angelina Wang; Sanmi Koyejo

arXiv:2505.10573·cs.CY·June 27, 2025·2 cites

Measurement to Meaning: A Validity-Centered Framework for AI Evaluation

Olawale Salaudeen, Anka Reuel, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, Sanmi Koyejo

PDF

Open Access

TL;DR

This paper introduces a structured, validity-centered framework for evaluating AI systems, aiming to improve the clarity and reliability of claims based on diverse measurement methods.

Contribution

It offers a novel framework based on psychometric validity principles to better interpret and construct AI evaluations, enhancing their relevance and rigor.

Findings

01

Framework clarifies the relationship between evaluation results and claims.

02

Case studies demonstrate improved evaluation design for vision and language models.

03

Enhances decision-making by aligning evidence with specific claims.

Abstract

While the capabilities and utility of AI systems have advanced, rigorous norms for evaluating these systems have lagged. Grand claims, such as models achieving general reasoning capabilities, are supported with model performance on narrow benchmarks, like performance on graduate-level exam questions, which provide a limited and potentially misleading assessment. We provide a structured approach for reasoning about the types of evaluative claims that can be made given the available evidence. For instance, our framework helps determine whether performance on a mathematical benchmark is an indication of the ability to solve problems on math tests or instead indicates a broader ability to reason. Our framework is well-suited for the contemporary paradigm in machine learning, where various stakeholders provide measurements and evaluations that downstream users use to validate their claims…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Topic Modeling