Structured dataset documentation: a datasheet for CheXpert
Christian Garbin, Pranav Rajpurkar, Jeremy Irvin, Matthew P. Lungren,, Oge Marques

TL;DR
This paper provides a detailed datasheet for the CheXpert dataset, emphasizing the importance of structured documentation for reliable medical image datasets to enhance machine learning applications in radiology.
Contribution
It introduces a comprehensive datasheet for CheXpert, illustrating how structured dataset documentation improves transparency and reliability in medical AI research.
Findings
Radiologist involvement ensures high-quality labels.
Structured documentation clarifies dataset composition and usage.
The datasheet serves as a model for dataset transparency.
Abstract
Billions of X-ray images are taken worldwide each year. Machine learning, and deep learning in particular, has shown potential to help radiologists triage and diagnose images. However, deep learning requires large datasets with reliable labels. The CheXpert dataset was created with the participation of board-certified radiologists, resulting in the strong ground truth needed to train deep learning networks. Following the structured format of Datasheets for Datasets, this paper expands on the original CheXpert paper and other sources to show the critical role played by radiologists in the creation of reliable labels and to describe the different aspects of the dataset composition in detail. Such structured documentation intends to increase the awareness in the machine learning and medical communities of the strengths, applications, and evolution of CheXpert, thereby advancing the field…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI · Machine Learning in Healthcare · AI in cancer detection
