Navigating Dataset Documentations in AI: A Large-Scale Analysis of   Dataset Cards on Hugging Face

Xinyu Yang; Weixin Liang; James Zou

arXiv:2401.13822·cs.LG·January 26, 2024·6 cites

Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face

Xinyu Yang, Weixin Liang, James Zou

PDF

Open Access 1 Repo

TL;DR

This study analyzes 7,433 dataset cards on Hugging Face to understand documentation practices, revealing variability in completeness, focus areas, and themes, and emphasizing the importance of thorough documentation for dataset quality and usability.

Contribution

It provides the first large-scale empirical analysis of dataset documentation practices on Hugging Face, highlighting gaps and areas for improvement.

Findings

01

Heterogeneity in dataset card completion rates correlates with dataset popularity.

02

Practitioners focus more on Dataset Description and Structure, less on Usage considerations.

03

Themes include technical, social impacts, and limitations discussed in dataset documentation.

Abstract

Advances in machine learning are closely tied to the creation of datasets. While data documentation is widely recognized as essential to the reliability, reproducibility, and transparency of ML, we lack a systematic empirical understanding of current dataset documentation practices. To shed light on this question, here we take Hugging Face -- one of the largest platforms for sharing and collaborating on ML models and datasets -- as a prominent case study. By analyzing all 7,433 dataset documentation on Hugging Face, our investigation provides an overview of the Hugging Face dataset ecosystem and insights into dataset documentation practices, yielding 5 main findings: (1) The dataset card completion rate shows marked heterogeneity correlated with dataset popularity. (2) A granular examination of each section within the dataset card reveals that the practitioners seem to prioritize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

youngxinyu1802/huggingface-dataset-card-analysis
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Scientific Computing and Data Management