The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track
Eshta Bhardwaj, Harshit Gujral, Siyi Wu, Ciara Zogheib, Tegan Maharaj,, Christoph Becker

TL;DR
This paper evaluates dataset development practices at NeurIPS, highlighting gaps in documentation related to environmental impact and ethics, and proposes strategies to enhance data curation for better ML reproducibility and responsibility.
Contribution
It introduces a novel evaluation framework for dataset documentation and applies it to assess current practices, offering targeted recommendations for improvement.
Findings
Need for better documentation on environmental footprint
Insufficient ethical considerations in dataset documentation
Recommendations to improve data management practices
Abstract
Data curation is a field with origins in librarianship and archives, whose scholarship and thinking on data issues go back centuries, if not millennia. The field of machine learning is increasingly observing the importance of data curation to the advancement of both applications and fundamental understanding of machine learning models - evidenced not least by the creation of the Datasets and Benchmarks track itself. This work provides an analysis of dataset development practices at NeurIPS through the lens of data curation. We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit developed through a literature review of data curation principles. We use the framework to assess the strengths and weaknesses in current dataset development practices of 60 datasets published in the NeurIPS Datasets and Benchmarks track from 2021-2023. We summarize key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsResearch Data Management Practices · Scientific Computing and Data Management · Biomedical Text Mining and Ontologies
