Soft Equivariance Regularization for Invariant Self-Supervised Learning

Joohyung Lee; Changhun Kim; Hyunsu Kim; Kwanhyung Lee; Juho Lee

arXiv:2603.06693·cs.CV·March 10, 2026

Soft Equivariance Regularization for Invariant Self-Supervised Learning

Joohyung Lee, Changhun Kim, Hyunsu Kim, Kwanhyung Lee, Juho Lee

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Soft Equivariance Regularization (SER), a novel method that decouples invariance and equivariance enforcement in self-supervised learning, leading to improved robustness and accuracy on various vision benchmarks.

Contribution

SER is a plug-in regularizer that applies equivariance constraints on intermediate features without altering the final embedding, addressing the trade-off in previous coupled approaches.

Findings

01

SER improves ImageNet-1k linear evaluation by +0.84 Top-1

02

SER enhances robustness on ImageNet-C/P by +1.11/+1.22 Top-1

03

Applying layer-decoupling to existing methods boosts their accuracy.

Abstract

Self-supervised learning (SSL) typically learns representations invariant to semantic-preserving augmentations. While effective for recognition, enforcing strong invariance can suppress transformation-dependent structure that is useful for robustness to geometric perturbations and spatially sensitive transfer. A growing body of work, therefore, augments invariance-based SSL with equivariance objectives, but these objectives are often imposed on the same final representation. We empirically observe a trade-off in this coupled setting: pushing equivariance regularization toward deeper layers improves equivariance scores but degrades ImageNet-1k linear evaluation, motivating a layer-decoupled design. Motivated by this trade-off, we propose Soft Equivariance Regularization (SER), a plug-in regularizer that decouples where invariance and equivariance are enforced: we keep the base SSL…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- Practical way to use transformers for invariance and equivariance objectives in SSL, considers spatial collapse and distinguishes between invertible and noninvertible augmentations. - Experiments are done at ImageNet scale. - The evaluation is pretty comprehensive: considers linear and nonlinear evals, robustness (ImageNet-P), and also detection transfer. They consider all the SOTA SSL baselines I'm aware of and also do extensive ablations on the intermediate layer position. - Does not require

Weaknesses

See questions.

Reviewer 02Rating 4Confidence 4

Strengths

- The proposed approach can be integrated with off the shelf SSL methods and architectures, making it widely usable. - The goal of learning invariant representations, but still enforcing intermediate representations to be equivariant is an interesting take on equivariant SSL in general. - The scale of the experiments is appreciated, making it closer to state of the art setups. - The authors demonstrate performance gains over baseline methods and other equivariant methods, with the caveat that

Weaknesses

1) Line 94: “Our approach requires no explicit transformation labels”. If my understanding is correct, having access to the exact transformation (generated from the labels) is required, which while a different constraint can be even harder to obtain in practice. Perhaps [1] (cited in the paper) should be discussed more, as the authors use a similar idea by applying the transformation to few dimensions of the representations, as defined by the dataset/transformation. 2) Claims of being the first

Reviewer 03Rating 2Confidence 4

Strengths

- The paper is well written and easy to follow - SER does not require transformation labels or predictors; it leverages known group actions on feature maps with a simple NT-Xent loss, keeping the method lightweight and scalable. - By excluding crop from set of groups $G$ and keeping it in the invariance path, the method respects group theory while preserving the benefits of strong augmentations for representation quality. - Improves linear/nonlinear evaluation on ImageNet-1k across MoCo-v3/DINO

Weaknesses

- The paper applies rotations, flips, and anisotropic scaling at intermediate feature maps but does not detail how resampling/interpolation, padding patch-grid alignment are handled. On discrete token lattices, 90° rotations are exact but scales are not. The chosen interpolation kernel can materially affect equivariance error. A detailed description and sensitivity study are missing. - SER avoids predicting labels, yet it relies on knowing $g_2g_1^{-1}$ from the augmentation pipeline and on hav

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Face recognition and analysis