The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold
Tiberiu Musat

TL;DR
This paper explains the grokking phenomenon in neural networks as a process of norm minimization on the zero-loss manifold, providing theoretical proofs and a simplified model that reproduces delayed generalization.
Contribution
It introduces a formal framework linking grokking to constrained optimization and derives a closed-form expression for post-memorization dynamics, supported by experimental validation.
Findings
Gradient descent minimizes weight norm on the zero-loss manifold.
The derived model reproduces delayed generalization and representation learning.
Theoretical proofs are provided in the limit of small learning rates and weight decay.
Abstract
Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete memorization of the training data. Previous research has linked this delayed generalization to representation learning driven by weight decay, but the precise underlying dynamics remain elusive. In this paper, we argue that post-memorization learning can be understood through the lens of constrained optimization: gradient descent effectively minimizes the weight norm on the zero-loss manifold. We formally prove this in the limit of infinitesimally small learning rates and weight decay coefficients. To further dissect this regime, we introduce an approximation that decouples the learning dynamics of a subset of parameters from the rest of the network. Applying this framework, we derive a closed-form expression for the post-memorization dynamics of…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper is overall well written and easy to follow. - The topic is definitely timely and relevant and the analysis is interesting. In particular, the approach taken to study isolated dynamics could be relevant in other settings.
- Some very relevant papers are missing for the literature review. In particular, Q1 investigated in this paper has already been quite thoroughly addressed in [1]. [2] also shows that this behaviour extends to other form of regularization, beyond weight decay. While the technical tools used in this submission to support the hypothesis that grokking is an artifact of regularization (memorization induced by data fitting term followed by slow convergence to generalizing solution driven by regulariz
**Crisp, elegant result:** The overall takeaway is quite elegant and easy to understand. **Very well-written**: This was an exceptionally clear paper. I was really impressed with how easy this was to read. Similarly, the figures were wonderfully designed. **Interesting methods**: The separation of dynamical systems into slow variables and fast variables that can be integrated out is a pillar of dynamical systems analysis. Applying it in this setting to understand learning on the zero-loss set
**1. Missing empirical validation.** After solving for the closed-form expression of the post-memorization dynamics of (part of) a grokking model, the authors show that simulating this training process reproduces two of the behaviors associated with grokking: delayed generalization and circular representation learning. But, if I understand correctly, this does not establish the central theoretical claim (lines 113-118): > While the previous example illustrates our theoretical framework, it does
The general idea -- that in the presence of weight decay dynamics on the zero loss manifold are determined by the weight decay term -- is (trivially) correct.
The actual theorems seem trivial to me, and the presentation is over complicated. If one assumes that the dynamics has reached the zero loss manifold, then by definition further dynamics are not controlled by loss gradients (which vanish) but only by the regularization, if present. However, this does not answer two things which are crucial for the proposed mechanism to work: 1. why should the dynamics reach the zero loss manifold at all, in the presence of regularization? 2. Assuming it does, w
Obtaining a better theoretical understanding of grokking is an interesting question that has attracted interest in recent years. Explaining grokking by analyzing the dynamics after memorization is a natural approach, and, as the paper argues, weight decay seems to play a key role in the post-memorization dynamics.
The bottom line is that I don’t think the contributions of the paper are significant enough. In Section 4, the main contributions are Theorems 4.9 and 4.13. Theorem 4.9 provides a nice observation, namely, that with a small enough weight decay, once gradient flow reaches the zero-loss manifold, it will stay close to it. Then, Theorem 4.13 establishes that, under some assumptions, around the zero-loss manifold, the gradient of the unregularized loss will be roughly orthogonal to the manifold. H
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing · Advanced Memory and Neural Computing
