The State of Data Curation at NeurIPS: An Assessment of Dataset   Development Practices in the Datasets and Benchmarks Track

Eshta Bhardwaj; Harshit Gujral; Siyi Wu; Ciara Zogheib; Tegan Maharaj,; Christoph Becker

arXiv:2410.22473·cs.CY·January 6, 2025

The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track

Eshta Bhardwaj, Harshit Gujral, Siyi Wu, Ciara Zogheib, Tegan Maharaj,, Christoph Becker

PDF

Open Access 1 Video

TL;DR

This paper evaluates dataset development practices at NeurIPS, highlighting gaps in documentation related to environmental impact and ethics, and proposes strategies to enhance data curation for better ML reproducibility and responsibility.

Contribution

It introduces a novel evaluation framework for dataset documentation and applies it to assess current practices, offering targeted recommendations for improvement.

Findings

01

Need for better documentation on environmental footprint

02

Insufficient ethical considerations in dataset documentation

03

Recommendations to improve data management practices

Abstract

Data curation is a field with origins in librarianship and archives, whose scholarship and thinking on data issues go back centuries, if not millennia. The field of machine learning is increasingly observing the importance of data curation to the advancement of both applications and fundamental understanding of machine learning models - evidenced not least by the creation of the Datasets and Benchmarks track itself. This work provides an analysis of dataset development practices at NeurIPS through the lens of data curation. We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit developed through a literature review of data curation principles. We use the framework to assess the strengths and weaknesses in current dataset development practices of 60 datasets published in the NeurIPS Datasets and Benchmarks track from 2021-2023. We summarize key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track· slideslive

Taxonomy

TopicsResearch Data Management Practices · Scientific Computing and Data Management · Biomedical Text Mining and Ontologies