CheXpert Plus: Augmenting a Large Chest X-ray Dataset with Text   Radiology Reports, Patient Demographics and Additional Image Formats

Pierre Chambon; Jean-Benoit Delbrouck; Thomas Sounack; Shih-Cheng; Huang; Zhihong Chen; Maya Varma; Steven QH Truong; Chu The Chuong; Curtis P.; Langlotz

arXiv:2405.19538·cs.CL·June 5, 2024·3 cites

CheXpert Plus: Augmenting a Large Chest X-ray Dataset with Text Radiology Reports, Patient Demographics and Additional Image Formats

Pierre Chambon, Jean-Benoit Delbrouck, Thomas Sounack, Shih-Cheng, Huang, Zhihong Chen, Maya Varma, Steven QH Truong, Chu The Chuong, Curtis P., Langlotz

PDF

Open Access 1 Repo 2 Models

TL;DR

CheXpert Plus significantly expands radiology datasets by integrating large-scale text reports, patient metadata, and images, facilitating advanced AI research in radiology with improved robustness, fairness, and cross-institutional training capabilities.

Contribution

It introduces the largest publicly available radiology text dataset paired with images, including extensive de-identification and metadata, enabling scalable cross-institutional AI model training.

Findings

01

Largest radiology text dataset with 36 million tokens.

02

First large-scale English paired dataset enabling cross-institution training.

03

Extensive de-identification of PHI spans in radiology reports.

Abstract

Since the release of the original CheXpert paper five years ago, CheXpert has become one of the most widely used and cited clinical AI datasets. The emergence of vision language models has sparked an increase in demands for sharing reports linked to CheXpert images, along with a growing interest among AI fairness researchers in obtaining demographic data. To address this, CheXpert Plus serves as a new collection of radiology data sources, made publicly available to enhance the scaling, performance, robustness, and fairness of models for all subsequent machine learning tasks in the field of radiology. CheXpert Plus is the largest text dataset publicly released in radiology, with a total of 36 million text tokens, including 13 million impression tokens. To the best of our knowledge, it represents the largest text de-identification effort in radiology, with almost 1 million PHI spans…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stanford-aimi/chexpert-plus
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRadiomics and Machine Learning in Medical Imaging