Measurement to Meaning: A Validity-Centered Framework for AI Evaluation
Olawale Salaudeen, Anka Reuel, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, Sanmi Koyejo

TL;DR
This paper introduces a structured, validity-centered framework for evaluating AI systems, aiming to improve the clarity and reliability of claims based on diverse measurement methods.
Contribution
It offers a novel framework based on psychometric validity principles to better interpret and construct AI evaluations, enhancing their relevance and rigor.
Findings
Framework clarifies the relationship between evaluation results and claims.
Case studies demonstrate improved evaluation design for vision and language models.
Enhances decision-making by aligning evidence with specific claims.
Abstract
While the capabilities and utility of AI systems have advanced, rigorous norms for evaluating these systems have lagged. Grand claims, such as models achieving general reasoning capabilities, are supported with model performance on narrow benchmarks, like performance on graduate-level exam questions, which provide a limited and potentially misleading assessment. We provide a structured approach for reasoning about the types of evaluative claims that can be made given the available evidence. For instance, our framework helps determine whether performance on a mathematical benchmark is an indication of the ability to solve problems on math tests or instead indicates a broader ability to reason. Our framework is well-suited for the contemporary paradigm in machine learning, where various stakeholders provide measurements and evaluations that downstream users use to validate their claims…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Topic Modeling
