Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning
Eun Seo Jo, Timnit Gebru

TL;DR
This paper emphasizes the importance of systematic data collection in machine learning, especially for sociocultural data, by drawing lessons from archival practices to improve fairness, transparency, and ethics.
Contribution
It advocates for establishing a dedicated ML specialization focused on data collection methodologies inspired by archival and library practices.
Findings
Highlights parallels between archival practices and sociocultural data collection
Identifies key challenges: consent, power, inclusivity, transparency, ethics & privacy
Proposes interdisciplinary approaches to improve data collection in ML
Abstract
A growing body of work shows that many problems in fairness, accountability, transparency, and ethics in machine learning systems are rooted in decisions surrounding the data collection and annotation process. In spite of its fundamental nature however, data collection remains an overlooked part of the machine learning (ML) pipeline. In this paper, we argue that a new specialization should be formed within ML that is focused on methodologies for data collection and annotation: efforts that require institutional frameworks and procedures. Specifically for sociocultural data, parallels can be drawn from archives and libraries. Archives are the longest standing communal effort to gather human information and archive scholars have already developed the language and procedures to address and discuss many challenges pertaining to data collection such as consent, power, inclusivity,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Data Quality and Management · Privacy-Preserving Technologies in Data
