Evaluation Gaps in Machine Learning Practice
Ben Hutchinson, Negar Rostamzadeh, Christina Greer, Katherine Heller,, Vinodkumar Prabhakaran

TL;DR
This paper investigates the limited scope of current ML evaluation practices, highlighting the neglect of important contextual and normative factors, and advocates for more comprehensive, context-aware evaluation methods to ensure responsible ML deployment.
Contribution
It empirically analyzes evaluation practices in top ML conferences, revealing implicit normative assumptions and proposing the need for more contextualized evaluation methodologies.
Findings
Focus on narrow evaluation metrics in CV and NLP
Neglect of contextual and normative properties in evaluations
Implicit commitments like consequentialism and quantifiability influence evaluation choices
Abstract
Forming a reliable judgement of a machine learning (ML) model's appropriateness for an application ecosystem is critical for its responsible use, and requires considering a broad range of factors including harms, benefits, and responsibilities. In practice, however, evaluations of ML models frequently focus on only a narrow range of decontextualized predictive behaviours. We examine the evaluation gaps between the idealized breadth of evaluation concerns and the observed narrow focus of actual evaluations. Through an empirical study of papers from recent high-profile conferences in the Computer Vision and Natural Language Processing communities, we demonstrate a general focus on a handful of evaluation methods. By considering the metrics and test data distributions used in these methods, we draw attention to which properties of models are centered in the field, revealing the properties…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)
