Dataset Diversity Metrics and Impact on Classification Models
Th\'eo Sourget, Niclas Cla{\ss}en, Jack Junchi Xu, Rob van der Goot, Veronika Cheplygina

TL;DR
This paper investigates how various dataset diversity metrics relate to model performance and real-world clinical variability, revealing limited correlation with some metrics and highlighting scanner diversity as a key factor.
Contribution
It systematically evaluates multiple diversity metrics across image and medical datasets, analyzing their correlation with model performance and clinical expert insights.
Findings
Limited correlation between AUC and reference-free diversity metrics
Higher correlation between FID, semantic diversity, and model performance
Adding scanners can cause shortcut learning in models
Abstract
The diversity of training datasets is usually perceived as an important aspect to obtain a robust model. However, the definition of diversity is often not defined or differs across papers, and while some metrics exist, the quantification of this diversity is often overlooked when developing new algorithms. In this work, we study the behaviour of multiple dataset diversity metrics for image, text and metadata using MorphoMNIST, a toy dataset with controlled perturbations, and PadChest, a publicly available chest X-ray dataset. We evaluate whether these metrics correlate with each other but also with the intuition of a clinical expert. We also assess whether they correlate with downstream-task performance and how they impact the training dynamic of the models. We find limited correlations between the AUC and image or metadata reference-free diversity metrics, but higher correlations with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI · Artificial Intelligence in Healthcare and Education · Radiomics and Machine Learning in Medical Imaging
