A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling
Chong Wang, Yabin Zhang, Yunhe Gao, Maya Varma, Clemence Mottez, Faidra Patsatzi, Jiaming Liu, Jin Long, Jean-Benoit Delbrouck, Sergios Gatidis, Akshay S. Chaudhari, Curtis P. Langlotz

TL;DR
CheXficient is an efficient chest X-ray foundation model pretrained on a small, curated subset of data, achieving high performance while reducing computational costs and improving generalization to rare conditions.
Contribution
We introduce CheXficient, a data-efficient pretraining approach that selectively prioritizes informative samples, reducing data and compute requirements without sacrificing performance.
Findings
Pretrained on 22.7% of data, CheXficient matches or exceeds full-data models.
Selective sampling improves generalization to rare and under-represented conditions.
Achieves comparable performance across 20 benchmarks with less compute.
Abstract
Foundation models for medical imaging are typically pretrained on increasingly large datasets, following a "scale-at-all-costs" paradigm. However, this strategy faces two critical challenges: large-scale medical datasets often contain substantial redundancy and severe class imbalance that bias representation learning toward over-represented patterns, and indiscriminate training regardless of heterogeneity in data quality incurs considerable computational inefficiency. Here we demonstrate that active, principled data curation during pretraining can serve as a viable, cost-effective alternative to brute-force dataset enlargement. We introduce CheXficient, a chest X-ray (CXR) foundation model that selectively prioritizes informative training samples. CheXficient is pretrained on only 22.7% of 1,235,004 paired CXR images and reports while consuming under 27.3% of the total compute budget,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
