Self-Supervised Learning from Structural Invariance
Yipeng Zhang, Hafez Ghaemi, Jungyoon Lee, Shahab Bakhtiari, Eilif B. Muller, Laurent Charlin

TL;DR
This paper introduces AdaSSL, a flexible self-supervised learning method that models the one-to-many mapping problem using latent variables, improving representation learning in various visual tasks.
Contribution
It proposes a novel variational approach with a regularization term for SSL, addressing the challenge of conditional uncertainty in data pairings.
Findings
AdaSSL improves causal representation learning.
It enhances fine-grained image understanding.
The method benefits world modeling on videos.
Abstract
Joint-embedding self-supervised learning (SSL), the key paradigm for unsupervised representation learning from visual data, learns from invariances between semantically-related data pairs. We study the one-to-many mapping problem in SSL, where each datum may be mapped to multiple valid targets. This arises when data pairs come from naturally occurring generative processes, e.g., successive video frames. We show that existing methods struggle to flexibly capture this conditional uncertainty. As a remedy, we introduce a latent variable to account for this uncertainty and derive a variational lower bound on the mutual information between paired embeddings. Our derivation yields a simple regularization term for standard SSL objectives. The resulting method, which we call AdaSSL, applies to both contrastive and distillation-based SSL objectives, and we empirically show its versatility in…
Peer Reviews
Decision·ICLR 2026 Poster
The paper is relatively clear and well explained (see weaknesses for exceptions). The motivation seems sound and the proposed solutions are rationally justified. Experimental results suggest the proposed models are effective.
A main weakness of the paper is its occasional lack of clarity. For the most part the paper is easy to follow, which makes the following very noticeable. The paper would improve greatly if these areas were addressed. * 33 - this does not seem to be about "distribution shift", i.e. a change to the distribution, but rather that artificial augmentations do not span the full variation in the distribution of natural images * Prop 2.1 - this is unclear, e.g. in line 156 we already know h maps to the u
- The paper tackles a key problem in self-supervised learning: modelling conditional uncertainty, which arises in many other problems related to causal prediction. The potential applications are therefore numerous: video prediction, video generation, world modelling and latent action prediction, efficient self-supervised learning ect. The paper could actually do a better job at motivating these applications. - The ideas presented in the paper are interesting, described in depth, well-motivated,
- The paper is complex to understand, with lots of formalism (the probabilist framework, Proposition 2.1), complex vocabulary (heteroscedasticity, modular editing, DCI) that is not necessarily introduced, or introduces lots of new vocabulary (CRL, DGP, SSL from structural invariance, Adaptive SSL, natural pairs), all for ideas that are actually fairly simple. I feel like this over–complixification hinders the reading flow and makes it harder to deliver the message it intends to deliver. - Also,
SSL methods are widely used, and their theoretical study is welcome. The dea of modeling the dependency of the views on some latent variables is interesting.
Although the topic of the paper is interesting, the presentation is hard to follow without well identified objectives and experiments mostly limited to toy examples. The presentation should be clarified. For example: - Section 2 defines the data generation process in terms of latent variables z and z+, but in Section 3 these disappear to be replaced by a latent variable r presumably there to parameterize the predictor, simillar to predictive SSL methods. - I did not understand the difference b
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
