Beyond Size and Class Balance: Alpha as a New Dataset Quality Metric for Deep Learning
Josiah Couch, Rima Arnaout, Ramy Arnaout

TL;DR
This paper introduces a new dataset quality metric called alpha, based on diversity measures from ecology, which better predicts deep learning performance than size or class balance, especially in medical imaging.
Contribution
The paper proposes alpha, a set of diversity measures, as a novel dataset quality metric that correlates more strongly with model performance than traditional size and class balance metrics.
Findings
Alpha measures explain 67% of performance variance.
Maximizing alpha improves model accuracy by up to 16%.
Size and class balance are less predictive of performance.
Abstract
In deep learning, achieving high performance on image classification tasks requires diverse training sets. However, the current best practicemaximizing dataset size and class balancedoes not guarantee dataset diversity. We hypothesized that, for a given model architecture, model performance can be improved by maximizing diversity more directly. To test this hypothesis, we introduce a comprehensive framework of diversity measures from ecology that generalizes familiar quantities like Shannon entropy by accounting for similarities among images. (Size and class balance emerge as special cases.) Analyzing thousands of subsets from seven medical datasets showed that the best correlates of performance were not size or class balance but "big alpha"a set of generalized entropy measures interpreted as the effective number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in cancer detection · Radiomics and Machine Learning in Medical Imaging · Artificial Intelligence in Healthcare and Education
MethodsSparse Evolutionary Training
