A Systematic Review of NeurIPS Dataset Management Practices

Yiwei Wu; Leah Ajmani; Shayne Longpre; Hanlin Li

arXiv:2411.00266·cs.LG·November 4, 2024

A Systematic Review of NeurIPS Dataset Management Practices

Yiwei Wu, Leah Ajmani, Shayne Longpre, Hanlin Li

PDF

Open Access 1 Video

TL;DR

This paper systematically reviews dataset management practices at NeurIPS, highlighting inconsistencies in provenance, hosting, and metadata, and emphasizes the need for standardized data infrastructures.

Contribution

It provides a comprehensive overview of current dataset management practices at NeurIPS, identifying key issues and gaps in provenance, metadata, and version control.

Findings

01

Provenance is often unclear due to ambiguous curation.

02

Dataset hosting sites vary widely in metadata support.

03

There is an urgent need for standardized data management infrastructures.

Abstract

As new machine learning methods demand larger training datasets, researchers and developers face significant challenges in dataset management. Although ethics reviews, documentation, and checklists have been established, it remains uncertain whether consistent dataset management practices exist across the community. This lack of a comprehensive overview hinders our ability to diagnose and address fundamental tensions and ethical issues related to managing large datasets. We present a systematic review of datasets published at the NeurIPS Datasets and Benchmarks track, focusing on four key aspects: provenance, distribution, ethical disclosure, and licensing. Our findings reveal that dataset provenance is often unclear due to ambiguous filtering and curation processes. Additionally, a variety of sites are used for dataset hosting, but only a few offer structured metadata and version…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

A Systematic Review of NeurIPS Dataset Management Practices· slideslive

Taxonomy

TopicsMachine Learning in Healthcare · Brain Tumor Detection and Classification · Artificial Intelligence in Healthcare