Confounding variables can degrade generalization performance of radiological deep learning models
John R. Zech, Marcus A. Badgeley, Manway Liu, Anthony B. Costa, Joseph, J. Titano, Eric K. Oermann

TL;DR
This study shows that CNNs trained on chest x-rays often perform worse on data from different hospitals due to confounding variables, highlighting challenges in clinical generalization.
Contribution
It demonstrates the impact of confounding variables on CNN generalization across hospital systems and emphasizes the need for careful evaluation of model robustness.
Findings
CNNs perform significantly worse on external hospital data in most cases.
CNNs can accurately identify hospital system and department from x-rays.
Performance estimates may overstate real-world effectiveness due to confounding factors.
Abstract
Early results in using convolutional neural networks (CNNs) on x-rays to diagnose disease have been promising, but it has not yet been shown that models trained on x-rays from one hospital or one group of hospitals will work equally well at different hospitals. Before these tools are used for computer-aided diagnosis in real-world clinical settings, we must verify their ability to generalize across a variety of hospital systems. A cross-sectional design was used to train and evaluate pneumonia screening CNNs on 158,323 chest x-rays from NIH (n=112,120 from 30,805 patients), Mount Sinai (42,396 from 12,904 patients), and Indiana (n=3,807 from 3,683 patients). In 3 / 5 natural comparisons, performance on chest x-rays from outside hospitals was significantly lower than on held-out x-rays from the original hospital systems. CNNs were able to detect where an x-ray was acquired (hospital…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
