Benchmarking CXR Foundation Models With Publicly Available MIMIC-CXR and NIH-CXR14 Datasets
Jiho Shin, Dominic Marshall, Matthieu Komorowski

TL;DR
This paper benchmarks two large-scale chest X-ray foundation models on public datasets, comparing their performance, stability, and disease-specific embedding structures to establish reproducible evaluation standards.
Contribution
It provides a standardized benchmarking framework for CXR foundation models using public datasets, highlighting differences in performance and stability.
Findings
MedImageInsight achieved slightly higher performance
CXR-Foundation showed strong cross-dataset stability
Embeddings revealed disease-specific structures
Abstract
Recent foundation models have demonstrated strong performance in medical image representation learning, yet their comparative behaviour across datasets remains underexplored. This work benchmarks two large-scale chest X-ray (CXR) embedding models (CXR-Foundation (ELIXR v2.0) and MedImagelnsight) on public MIMIC-CR and NIH ChestX-ray14 datasets. Each model was evaluated using a unified preprocessing pipeline and fixed downstream classifiers to ensure reproducible comparison. We extracted embeddings directly from pre-trained encoders, trained lightweight LightGBM classifiers on multiple disease labels, and reported mean AUROC, and F1-score with 95% confidence intervals. MedImageInsight achieved slightly higher performance across most tasks, while CXR-Foundation exhibited strong cross-dataset stability. Unsupervised clustering of MedImageIn-sight embeddings further revealed a coherent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI · AI in cancer detection · Radiomics and Machine Learning in Medical Imaging
