Machine Learning Data Practices through a Data Curation Lens: An   Evaluation Framework

Eshta Bhardwaj; Harshit Gujral; Siyi Wu; Ciara Zogheib; Tegan Maharaj,; Christoph Becker

arXiv:2405.02703·cs.CY·May 7, 2024

Machine Learning Data Practices through a Data Curation Lens: An Evaluation Framework

Eshta Bhardwaj, Harshit Gujral, Siyi Wu, Ciara Zogheib, Tegan Maharaj,, Christoph Becker

PDF

TL;DR

This paper introduces an evaluation framework that applies data curation principles to machine learning datasets, highlighting challenges and proposing solutions to improve data practices in ML development.

Contribution

It develops a novel rubric for assessing ML datasets through data curation concepts and analyzes its feasibility and challenges in practice.

Findings

01

Researchers struggle to apply standard data curation principles to ML datasets.

02

Difficulties arise from shared terms with different meanings across fields.

03

Challenges include interpretative flexibility and scope of documentation responsibilities.

Abstract

Studies of dataset development in machine learning call for greater attention to the data practices that make model development possible and shape its outcomes. Many argue that the adoption of theory and practices from archives and data curation fields can support greater fairness, accountability, transparency, and more ethical machine learning. In response, this paper examines data practices in machine learning dataset development through the lens of data curation. We evaluate data practices in machine learning as data curation practices. To do so, we develop a framework for evaluating machine learning datasets using data curation concepts and principles through a rubric. Through a mixed-methods analysis of evaluation results for 25 ML datasets, we study the feasibility of data curation principles to be adopted for machine learning data work in practice and explore how data curation is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.