TL;DR
This paper analyzes the stop gradient and exponential moving average procedures in non-contrastive self-supervised learning, revealing their role in avoiding collapse through a dual optimization and dynamical systems perspective, supported by empirical evidence.
Contribution
It provides a theoretical understanding of how these procedures prevent collapse, showing they do not optimize the original objective but are asymptotically stable in linear models.
Findings
Stop gradient and EMA avoid collapse without optimizing the original objective.
In linear models, minimizing the original objective leads to collapse.
Equilibria of the associated dynamical systems are explicitly characterized and are asymptotically stable.
Abstract
The {\em stop gradient} and {\em exponential moving average} iterative procedures are commonly used in non-contrastive approaches to self-supervised learning to avoid representation collapse, with excellent performance in downstream applications in practice. This presentation investigates these procedures from the dual viewpoints of optimization and dynamical systems. We show that, in general, although they {\em do not} optimize the original objective, or {\em any} other smooth function, they {\em do} avoid collapse Following~\citet{Tian21}, but without any of the extra assumptions used in their proofs, we then show using a dynamical system perspective that, in the linear case, minimizing the original objective function without the use of a stop gradient or exponential moving average {\em always} leads to collapse. Conversely, we characterize explicitly the equilibria of the dynamical…
Peer Reviews
Decision·ICLR 2026 Poster
* This work clearly shows the influence of SG and EMA from the theoretical perspective and conducts experiments to support their results.
* It would be better to provide a proof sketch for Proposition 2.2, which makes it easier for readers to understand why with SG and EMA, the $\bar{P}$ and $\bar{Q}$ are not the gradient fields of any smooth function. * We know that the analysis for the nonlinear setting is hard. However, it would be better to discuss how to extend to a nonlinear NN (even a 2-layer Softmax NN).
The topic of the paper is good. It tries to explain non-contrastive self-supervised methods like BYOL/SimSiam from both optimization theory and dynamical systems theory, which "learn good representations without seeming to have an objective function." This combination is innovative in the research of SSL theory. In particular, the perspective of dynamical systems (continuous-time analysis, equilibrium points, and stability proofs) provides new insights into the dynamics of self-supervised traini
1. The paper introduces several symbols on the same page, and these symbols represent different meanings in different contexts. This high density of symbolic definitions makes readers need to frequently backtrack and compare with the previous texts during reading, which increases the burden of understanding and is not conducive to quickly grasping the core deduction logic. 2. I understand the authors’ intention to introduce examples of SG and EMA directly in the introduction. However, such a st
1. Provides a clear theoretical foundation explaining why stop-gradient and EMA prevent representation collapse. 2. Offers rigorous analysis from both optimization and dynamical systems perspectives. 3. Empirical experiments effectively support the theoretical findings on real and synthetic data.
1. The statement in Lines 156–157 (“The SG and EMA training procedures have been designed to avoid collapse in self-supervised learning”) is not entirely accurate. Prior work, such as SimSiam, has already shown that Stop-Gradient alone can prevent collapse without EMA. Moreover, BYOL’s EMA inherently includes a Stop-Gradient operation, and the true collapse-prevention factor lies in the asymmetric *predictor* component — without it, SG or EMA alone would fail. 2. While the paper provides some t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
