Self-supervised Learning is More Robust to Dataset Imbalance

Hong Liu; Jeff Z. HaoChen; Adrien Gaidon; Tengyu Ma

arXiv:2110.05025·cs.LG·May 24, 2022·63 cites

Self-supervised Learning is More Robust to Dataset Imbalance

Hong Liu, Jeff Z. HaoChen, Adrien Gaidon, Tengyu Ma

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that self-supervised learning (SSL) is inherently more robust to dataset imbalance than supervised learning, and introduces a re-weighted regularization method to further improve SSL performance on imbalanced datasets.

Contribution

It provides a systematic investigation of SSL under dataset imbalance, revealing its robustness and proposing a novel re-weighted regularization technique to enhance SSL representations.

Findings

01

SSL representations are more robust to class imbalance than supervised ones.

02

The performance gap between balanced and imbalanced SSL pre-training is smaller than in supervised learning.

03

A re-weighted regularization method improves SSL performance on imbalanced datasets.

Abstract

Self-supervised learning (SSL) is a scalable way to learn general visual representations since it learns without labels. However, large-scale unlabeled datasets in the wild often have long-tailed label distributions, where we know little about the behavior of SSL. In this work, we systematically investigate self-supervised learning under dataset imbalance. First, we find out via extensive experiments that off-the-shelf self-supervised representations are already more robust to class imbalance than supervised representations. The performance gap between balanced and imbalanced pre-training with SSL is significantly smaller than the gap with supervised learning, across sample sizes, for both in-domain and, especially, out-of-domain evaluation. Second, towards understanding the robustness of SSL, we hypothesize that SSL learns richer features from frequent data: it may learn…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Liuhong99/Imbalanced-SSL
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Imbalanced Data Classification Techniques · Infrastructure Maintenance and Monitoring