A Framework for Deprecating Datasets: Standardizing Documentation,   Identification, and Communication

Alexandra Sasha Luccioni; Frances Corry; Hamsini Sridharan; Mike; Ananny; Jason Schultz; Kate Crawford

arXiv:2111.04424·cs.CY·May 11, 2022

A Framework for Deprecating Datasets: Standardizing Documentation, Identification, and Communication

Alexandra Sasha Luccioni, Frances Corry, Hamsini Sridharan, Mike, Ananny, Jason Schultz, Kate Crawford

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a comprehensive framework for dataset deprecation in machine learning, emphasizing standardization, documentation, and communication to improve data stewardship and prevent continued circulation of deprecated datasets.

Contribution

It proposes a novel Dataset Deprecation Framework and advocates for a centralized repository to enhance dataset lifecycle management in ML.

Findings

01

Identified issues with continued circulation of deprecated datasets

02

Developed a detailed framework for dataset deprecation processes

03

Suggested a centralized repository for dataset management

Abstract

Datasets are central to training machine learning (ML) models. The ML community has recently made significant improvements to data stewardship and documentation practices across the model development life cycle. However, the act of deprecating, or deleting, datasets has been largely overlooked, and there are currently no standardized approaches for structuring this stage of the dataset life cycle. In this paper, we study the practice of dataset deprecation in ML, identify several cases of datasets that continued to circulate despite having been deprecated, and describe the different technical, legal, ethical, and organizational issues raised by such continuations. We then propose a Dataset Deprecation Framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocols, and publication checks that can be adapted and implemented by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

society-ethics/papers
dataset· 44 dl
44 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Privacy-Preserving Technologies in Data · Machine Learning and Data Classification