On Pretraining Data Diversity for Self-Supervised Learning
Hasan Abed Al Kader Hammoud, Tuhin Das, Fabio Pizzati, Philip Torr,, Adel Bibi, Bernard Ghanem

TL;DR
This paper investigates how increasing the diversity of pretraining data affects self-supervised learning performance, highlighting benefits and challenges related to distribution shifts across large-scale datasets.
Contribution
It provides empirical evidence that greater data diversity improves SSL performance when distribution similarity is maintained, and discusses the limitations posed by distribution shifts.
Findings
Data diversity enhances SSL performance under similar distributions.
Distribution shift remains a significant challenge despite increased data diversity.
Large-scale experiments confirm the impact of dataset diversity on SSL outcomes.
Abstract
We explore the impact of training with more diverse datasets, characterized by the number of unique samples, on the performance of self-supervised learning (SSL) under a fixed computational budget. Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal. Notably, even with an exceptionally large pretraining data diversity achieved through methods like web crawling or diffusion-generated data, among other ways, the distribution shift remains a challenge. Our experiments are comprehensive with seven SSL methods using large-scale datasets such as ImageNet and YFCC100M amounting to over 200 GPU days. Code and trained models are available at https://github.com/hammoudhasan/DiversitySSL
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Face and Expression Recognition
