Data and its (dis)contents: A survey of dataset development and use in   machine learning research

Amandalynne Paullada; Inioluwa Deborah Raji; Emily M. Bender; Emily; Denton; Alex Hanna

arXiv:2012.05345·cs.LG·November 16, 2021

Data and its (dis)contents: A survey of dataset development and use in machine learning research

Amandalynne Paullada, Inioluwa Deborah Raji, Emily M. Bender, Emily, Denton, Alex Hanna

PDF

TL;DR

This survey reviews the role of datasets in machine learning, highlighting issues in data collection and use, and advocates for more careful practices to address ethical and practical challenges.

Contribution

It provides a comprehensive overview of dataset development, use, and associated concerns, emphasizing the need for improved data practices in machine learning research.

Findings

01

Identifies limitations in current dataset collection practices

02

Highlights ethical concerns in data sharing and use

03

Recommends more cautious and thorough data understanding

Abstract

Datasets have played a foundational role in the advancement of machine learning research. They form the basis for the models we design and deploy, as well as our primary medium for benchmarking and evaluation. Furthermore, the ways in which we collect, construct and share these datasets inform the kinds of problems the field pursues and the methods explored in algorithm development. However, recent work from a breadth of perspectives has revealed the limitations of predominant practices in dataset collection and use. In this paper, we survey the many concerns raised about the way we collect and use data in machine learning and advocate that a more cautious and thorough understanding of data is necessary to address several of the practical and ethical issues of the field.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.