Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework
Olivier Binette, Jerome P. Reiter

TL;DR
This paper introduces an estimands framework adapted from clinical trials to enhance the validity and practical relevance of AI/ML model evaluations, addressing issues of construct validity and misleading rankings.
Contribution
It proposes a systematic estimands framework for AI/ML evaluation, improving inference clarity and interpretability over traditional benchmark practices.
Findings
Identifies rank reversal issues in current evaluation methods.
Demonstrates how the framework uncovers evaluation biases.
Shows improved interpretability of evaluation results.
Abstract
Commonly, AI or machine learning (ML) models are evaluated on benchmark datasets. This practice supports innovative methodological research, but benchmark performance can be poorly correlated with performance in real-world applications -- a construct validity issue. To improve the validity and practical usefulness of evaluations, we propose using an estimands framework adapted from international clinical trials guidelines. This framework provides a systematic structure for inference and reporting in evaluations, emphasizing the importance of a well-defined estimation target. We illustrate our proposal on examples of commonly used evaluation methodologies - involving cross-validation, clustering evaluation, and LLM benchmarking - that can lead to incorrect rankings of competing models (rank reversals) with high probability, even when performance differences are large. We demonstrate how…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI) · Impact of AI and Big Data on Business and Society
