Demonstrating The Risk of Imbalanced Datasets in Chest X-ray Image-based Diagnostics by Prototypical Relevance Propagation
Srishti Gautam, Marina M.-C. H\"ohne, Stine Hansen, Robert Jenssen and, Michael Kampffmeyer

TL;DR
This paper investigates how imbalanced datasets across sources can cause models to learn spurious correlations in chest X-ray diagnostics, emphasizing the need for balanced data and transparent models to improve reliability.
Contribution
It provides a thorough analysis of label imbalance effects in multi-source chest X-ray datasets and demonstrates how balancing sources reduces spurious correlations.
Findings
Imbalanced source domains lead models to exploit source-specific cues.
Balanced datasets improve model transparency and reduce spurious learning.
Using self-explaining models enhances detection of learned biases.
Abstract
The recent trend of integrating multi-source Chest X-Ray datasets to improve automated diagnostics raises concerns that models learn to exploit source-specific correlations to improve performance by recognizing the source domain of an image rather than the medical pathology. We hypothesize that this effect is enforced by and leverages label-imbalance across the source domains, i.e, prevalence of a disease corresponding to a source. Therefore, in this work, we perform a thorough study of the effect of label-imbalance in multi-source training for the task of pneumonia detection on the widely used ChestX-ray14 and CheXpert datasets. The results highlight and stress the importance of using more faithful and transparent self-explaining models for automated diagnosis, thus enabling the inherent detection of spurious learning. They further illustrate that this undesirable effect of learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI · Radiomics and Machine Learning in Medical Imaging · Lung Cancer Diagnosis and Treatment
