The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards
Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, Kasia, Chmielinski

TL;DR
The paper introduces the Dataset Nutrition Label, a flexible framework designed to improve data quality and analysis practices in AI development by providing standardized, comprehensive dataset summaries.
Contribution
It presents a novel, adaptable framework for dataset analysis that can be applied across domains, with an open source prototype demonstrating its practical utility.
Findings
The Label enhances data analysis robustness.
It aids in dataset selection for AI training.
It promotes better data collection practices.
Abstract
Artificial intelligence (AI) systems built on incomplete or biased data will often exhibit problematic outcomes. Current methods of data analysis, particularly before model development, are costly and not standardized. The Dataset Nutrition Label (the Label) is a diagnostic framework that lowers the barrier to standardized data analysis by providing a distilled yet comprehensive overview of dataset "ingredients" before AI model development. Building a Label that can be applied across domains and data types requires that the framework itself be flexible and adaptable; as such, the Label is comprised of diverse qualitative and quantitative modules generated through multiple statistical and probabilistic modelling backends, but displayed in a standardized format. To demonstrate and advance this concept, we generated and published an open source prototype with seven sample modules on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNutritional Studies and Diet · Nutrition, Genetics, and Disease · Biomedical Text Mining and Ontologies
