Data-Efficient Contrastive Self-supervised Learning: Most Beneficial   Examples for Supervised Learning Contribute the Least

Siddharth Joshi; Baharan Mirzasoleiman

arXiv:2302.09195·cs.LG·March 14, 2024·1 cites

Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least

Siddharth Joshi, Baharan Mirzasoleiman

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper identifies which examples are most beneficial for contrastive self-supervised learning, showing that examples with highly similar augmentations to others are most valuable, enabling data reduction without performance loss.

Contribution

It provides the first theoretical analysis linking example similarity to their contribution in contrastive SSL and demonstrates effective data subset selection for efficiency.

Findings

01

Excluding 20-40% of data does not harm downstream performance.

02

Selected subsets outperform random subsets by over 3%.

03

Highly similar augmentation examples contribute most to contrastive SSL.

Abstract

Self-supervised learning (SSL) learns high-quality representations from large pools of unlabeled training data. As datasets grow larger, it becomes crucial to identify the examples that contribute the most to learning such representations. This enables efficient SSL by reducing the volume of data required. Nevertheless, quantifying the value of examples for SSL has remained an open question. In this work, we address this problem for the first time, by proving that examples that contribute the most to contrastive SSL are those that have the most similar augmentations to other examples, in expectation. We provide rigorous guarantees for the generalization performance of contrastive learning on such subsets. Through extensive experiments, we show that we can safely exclude 20% of examples from CIFAR100 and 40% from STL10 and TinyImageNet, without affecting downstream task performance. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least· slideslive

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Cancer-related molecular mechanisms research

MethodsContrastive Learning