Self-supervised Learning is More Robust to Dataset Imbalance
Hong Liu, Jeff Z. HaoChen, Adrien Gaidon, Tengyu Ma

TL;DR
This paper demonstrates that self-supervised learning (SSL) is inherently more robust to dataset imbalance than supervised learning, and introduces a re-weighted regularization method to further improve SSL performance on imbalanced datasets.
Contribution
It provides a systematic investigation of SSL under dataset imbalance, revealing its robustness and proposing a novel re-weighted regularization technique to enhance SSL representations.
Findings
SSL representations are more robust to class imbalance than supervised ones.
The performance gap between balanced and imbalanced SSL pre-training is smaller than in supervised learning.
A re-weighted regularization method improves SSL performance on imbalanced datasets.
Abstract
Self-supervised learning (SSL) is a scalable way to learn general visual representations since it learns without labels. However, large-scale unlabeled datasets in the wild often have long-tailed label distributions, where we know little about the behavior of SSL. In this work, we systematically investigate self-supervised learning under dataset imbalance. First, we find out via extensive experiments that off-the-shelf self-supervised representations are already more robust to class imbalance than supervised representations. The performance gap between balanced and imbalanced pre-training with SSL is significantly smaller than the gap with supervised learning, across sample sizes, for both in-domain and, especially, out-of-domain evaluation. Second, towards understanding the robustness of SSL, we hypothesize that SSL learns richer features from frequent data: it may learn…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Imbalanced Data Classification Techniques · Infrastructure Maintenance and Monitoring
