Soft Equivariance Regularization for Invariant Self-Supervised Learning
Joohyung Lee, Changhun Kim, Hyunsu Kim, Kwanhyung Lee, Juho Lee

TL;DR
This paper introduces Soft Equivariance Regularization (SER), a novel method that decouples invariance and equivariance enforcement in self-supervised learning, leading to improved robustness and accuracy on various vision benchmarks.
Contribution
SER is a plug-in regularizer that applies equivariance constraints on intermediate features without altering the final embedding, addressing the trade-off in previous coupled approaches.
Findings
SER improves ImageNet-1k linear evaluation by +0.84 Top-1
SER enhances robustness on ImageNet-C/P by +1.11/+1.22 Top-1
Applying layer-decoupling to existing methods boosts their accuracy.
Abstract
Self-supervised learning (SSL) typically learns representations invariant to semantic-preserving augmentations. While effective for recognition, enforcing strong invariance can suppress transformation-dependent structure that is useful for robustness to geometric perturbations and spatially sensitive transfer. A growing body of work, therefore, augments invariance-based SSL with equivariance objectives, but these objectives are often imposed on the same final representation. We empirically observe a trade-off in this coupled setting: pushing equivariance regularization toward deeper layers improves equivariance scores but degrades ImageNet-1k linear evaluation, motivating a layer-decoupled design. Motivated by this trade-off, we propose Soft Equivariance Regularization (SER), a plug-in regularizer that decouples where invariance and equivariance are enforced: we keep the base SSL…
Peer Reviews
Decision·ICLR 2026 Poster
- Practical way to use transformers for invariance and equivariance objectives in SSL, considers spatial collapse and distinguishes between invertible and noninvertible augmentations. - Experiments are done at ImageNet scale. - The evaluation is pretty comprehensive: considers linear and nonlinear evals, robustness (ImageNet-P), and also detection transfer. They consider all the SOTA SSL baselines I'm aware of and also do extensive ablations on the intermediate layer position. - Does not require
See questions.
- The proposed approach can be integrated with off the shelf SSL methods and architectures, making it widely usable. - The goal of learning invariant representations, but still enforcing intermediate representations to be equivariant is an interesting take on equivariant SSL in general. - The scale of the experiments is appreciated, making it closer to state of the art setups. - The authors demonstrate performance gains over baseline methods and other equivariant methods, with the caveat that
1) Line 94: “Our approach requires no explicit transformation labels”. If my understanding is correct, having access to the exact transformation (generated from the labels) is required, which while a different constraint can be even harder to obtain in practice. Perhaps [1] (cited in the paper) should be discussed more, as the authors use a similar idea by applying the transformation to few dimensions of the representations, as defined by the dataset/transformation. 2) Claims of being the first
- The paper is well written and easy to follow - SER does not require transformation labels or predictors; it leverages known group actions on feature maps with a simple NT-Xent loss, keeping the method lightweight and scalable. - By excluding crop from set of groups $G$ and keeping it in the invariance path, the method respects group theory while preserving the benefits of strong augmentations for representation quality. - Improves linear/nonlinear evaluation on ImageNet-1k across MoCo-v3/DINO
- The paper applies rotations, flips, and anisotropic scaling at intermediate feature maps but does not detail how resampling/interpolation, padding patch-grid alignment are handled. On discrete token lattices, 90° rotations are exact but scales are not. The chosen interpolation kernel can materially affect equivariance error. A detailed description and sensitivity study are missing. - SER avoids predicting labels, yet it relies on knowing $g_2g_1^{-1}$ from the augmentation pipeline and on hav
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Face recognition and analysis
