Divide and Contrast: Self-supervised Learning from Uncurated Data
Yonglong Tian, Olivier J. Henaff, Aaron van den Oord

TL;DR
This paper investigates contrastive self-supervised learning on large, uncurated datasets and introduces Divide and Contrast (DnC), a method that improves representation quality by combining contrastive learning with clustering-based hard negative mining.
Contribution
The paper presents DnC, a novel approach that enhances self-supervised learning on uncurated data by addressing class distribution shifts through alternating contrastive learning and clustering.
Findings
DnC significantly improves downstream task performance on uncurated datasets.
DnC remains competitive with state-of-the-art methods on curated datasets.
Contrastive learning effectiveness decreases with less curated data without DnC.
Abstract
Self-supervised learning holds promise in leveraging large amounts of unlabeled data, however much of its progress has thus far been limited to highly curated pre-training data such as ImageNet. We explore the effects of contrastive learning from larger, less-curated image datasets such as YFCC, and find there is indeed a large difference in the resulting representation quality. We hypothesize that this curation gap is due to a shift in the distribution of image classes -- which is more diverse and heavy-tailed -- resulting in less relevant negative samples to learn from. We test this hypothesis with a new approach, Divide and Contrast (DnC), which alternates between contrastive learning and clustering-based hard negative mining. When pretrained on less curated datasets, DnC greatly improves the performance of self-supervised learning on downstream tasks, while remaining competitive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in cancer detection · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI
MethodsContrastive Learning
