Bringing the People Back In: Contesting Benchmark Machine Learning Datasets
Remi Denton, Alex Hanna, Razvan Amironesei, Andrew Smart, Hilary, Nicole, Morgan Klaus Scheuerman

TL;DR
This paper proposes a genealogy approach to critically analyze the histories, values, and norms embedded in machine learning benchmark datasets, aiming to uncover biases and foster contestation.
Contribution
It introduces a research program for investigating the socio-historical context of dataset creation and emphasizes understanding the labor and values involved.
Findings
Benchmark datasets operate as infrastructure influencing model development.
Four key research questions guide the critical analysis of datasets.
Understanding dataset creation reveals biases and contestation opportunities.
Abstract
In response to algorithmic unfairness embedded in sociotechnical systems, significant attention has been focused on the contents of machine learning datasets which have revealed biases towards white, cisgender, male, and Western data subjects. In contrast, comparatively less attention has been paid to the histories, values, and norms embedded in such datasets. In this work, we outline a research program - a genealogy of machine learning data - for investigating how and why these datasets have been created, what and whose values influence the choices of data to collect, the contextual and contingent conditions of their creation. We describe the ways in which benchmark datasets in machine learning operate as infrastructure and pose four research questions for these datasets. This interrogation forces us to "bring the people back in" by aiding us in understanding the labor embedded in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Innovative Human-Technology Interaction · Mobile Crowdsensing and Crowdsourcing
