Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face
Xinyu Yang, Weixin Liang, James Zou

TL;DR
This study analyzes 7,433 dataset cards on Hugging Face to understand documentation practices, revealing variability in completeness, focus areas, and themes, and emphasizing the importance of thorough documentation for dataset quality and usability.
Contribution
It provides the first large-scale empirical analysis of dataset documentation practices on Hugging Face, highlighting gaps and areas for improvement.
Findings
Heterogeneity in dataset card completion rates correlates with dataset popularity.
Practitioners focus more on Dataset Description and Structure, less on Usage considerations.
Themes include technical, social impacts, and limitations discussed in dataset documentation.
Abstract
Advances in machine learning are closely tied to the creation of datasets. While data documentation is widely recognized as essential to the reliability, reproducibility, and transparency of ML, we lack a systematic empirical understanding of current dataset documentation practices. To shed light on this question, here we take Hugging Face -- one of the largest platforms for sharing and collaborating on ML models and datasets -- as a prominent case study. By analyzing all 7,433 dataset documentation on Hugging Face, our investigation provides an overview of the Hugging Face dataset ecosystem and insights into dataset documentation practices, yielding 5 main findings: (1) The dataset card completion rate shows marked heterogeneity correlated with dataset popularity. (2) A granular examination of each section within the dataset card reveals that the practitioners seem to prioritize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Scientific Computing and Data Management
