Are All Training Examples Created Equal? An Empirical Study
Kailas Vodrahalli, Ke Li, Jitendra Malik

TL;DR
This paper investigates the importance of individual training examples in large datasets for computer vision, revealing that smaller, carefully selected subsets can sometimes suffice for effective training, with implications for active learning.
Contribution
It introduces a gradient-based importance measure to empirically analyze training example significance across datasets, offering insights into dataset diversity and training efficiency.
Findings
Small subsamples can be sufficient for training in some datasets
Relative importance of examples varies across datasets
The analysis method aids understanding of dataset diversity
Abstract
Modern computer vision algorithms often rely on very large training datasets. However, it is conceivable that a carefully selected subsample of the dataset is sufficient for training. In this paper, we propose a gradient-based importance measure that we use to empirically analyze relative importance of training images in four datasets of varying complexity. We find that in some cases, a small subsample is indeed sufficient for training. For other datasets, however, the relative differences in importance are negligible. These results have important implications for active learning on deep networks. Additionally, our analysis method can be used as a general tool to better understand diversity of training examples in datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification
