The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning
Zixin Wen, Yuanzhi Li

TL;DR
This paper investigates how the prediction head in non-contrastive self-supervised learning, like BYOL, enables neural networks to learn comprehensive representations despite the existence of trivial collapsed solutions, through empirical and theoretical analysis.
Contribution
It provides the first end-to-end theoretical framework explaining the role of the prediction head in non-contrastive learning, highlighting substitution and acceleration effects that prevent feature collapse.
Findings
Prediction head initialized as identity aids learning of all features.
Substitution and acceleration effects facilitate comprehensive feature learning.
First theoretical guarantee for nonlinear neural networks with trainable prediction head.
Abstract
Recently the surprising discovery of the Bootstrap Your Own Latent (BYOL) method by Grill et al. shows the negative term in contrastive loss can be removed if we add the so-called prediction head to the network. This initiated the research of non-contrastive self-supervised learning. It is mysterious why even when there exist trivial collapsed global optimal solutions, neural networks trained by (stochastic) gradient descent can still learn competitive representations. This phenomenon is a typical example of implicit bias in deep learning and remains little understood. In this work, we present our empirical and theoretical discoveries on non-contrastive self-supervised learning. Empirically, we find that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trainable, the network can learn competitive representations even though the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Neural Networks and Applications
