The Benchmarking Epistemology: Construct Validity for Evaluating Machine Learning Models
Timo Freiesleben, Sebastian Zezulka

TL;DR
This paper develops a framework based on construct validity to better understand the epistemological implications of benchmarking in machine learning, emphasizing the assumptions needed for scientific inference from benchmark scores.
Contribution
It introduces conditions of construct validity for evaluating machine learning models and applies them through case studies to clarify when benchmark scores support scientific claims.
Findings
Benchmark scores alone are insufficient for scientific inference without explicit assumptions.
The framework clarifies the conditions needed for benchmarks to support diverse scientific claims.
Case studies demonstrate the application of the validity framework in different ML contexts.
Abstract
Predictive benchmarking, the evaluation of machine learning models based on predictive performance and competitive ranking, is a central epistemic practice in machine learning research and an increasingly prominent method for scientific inquiry. Yet, benchmark scores alone provide at best measurements of model performance relative to an evaluation dataset and a concrete learning problem. Drawing substantial scientific inferences from the results, say about theoretical tasks like image classification, requires additional assumptions about the theoretical structure of the learning problems, evaluation functions, and data distributions. We make these assumptions explicit by developing conditions of construct validity inspired by psychological measurement theory. We examine these assumptions in practice through three case studies, each exemplifying a typical intended inference: measuring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
