A Framework for Deprecating Datasets: Standardizing Documentation, Identification, and Communication
Alexandra Sasha Luccioni, Frances Corry, Hamsini Sridharan, Mike, Ananny, Jason Schultz, Kate Crawford

TL;DR
This paper introduces a comprehensive framework for dataset deprecation in machine learning, emphasizing standardization, documentation, and communication to improve data stewardship and prevent continued circulation of deprecated datasets.
Contribution
It proposes a novel Dataset Deprecation Framework and advocates for a centralized repository to enhance dataset lifecycle management in ML.
Findings
Identified issues with continued circulation of deprecated datasets
Developed a detailed framework for dataset deprecation processes
Suggested a centralized repository for dataset management
Abstract
Datasets are central to training machine learning (ML) models. The ML community has recently made significant improvements to data stewardship and documentation practices across the model development life cycle. However, the act of deprecating, or deleting, datasets has been largely overlooked, and there are currently no standardized approaches for structuring this stage of the dataset life cycle. In this paper, we study the practice of dataset deprecation in ML, identify several cases of datasets that continued to circulate despite having been deprecated, and describe the different technical, legal, ethical, and organizational issues raised by such continuations. We then propose a Dataset Deprecation Framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocols, and publication checks that can be adapted and implemented by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Privacy-Preserving Technologies in Data · Machine Learning and Data Classification
