The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold

Tiberiu Musat

arXiv:2511.01938·cs.LG·January 12, 2026

The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold

Tiberiu Musat

PDF

Open Access 4 Reviews

TL;DR

This paper explains the grokking phenomenon in neural networks as a process of norm minimization on the zero-loss manifold, providing theoretical proofs and a simplified model that reproduces delayed generalization.

Contribution

It introduces a formal framework linking grokking to constrained optimization and derives a closed-form expression for post-memorization dynamics, supported by experimental validation.

Findings

01

Gradient descent minimizes weight norm on the zero-loss manifold.

02

The derived model reproduces delayed generalization and representation learning.

03

Theoretical proofs are provided in the limit of small learning rates and weight decay.

Abstract

Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete memorization of the training data. Previous research has linked this delayed generalization to representation learning driven by weight decay, but the precise underlying dynamics remain elusive. In this paper, we argue that post-memorization learning can be understood through the lens of constrained optimization: gradient descent effectively minimizes the weight norm on the zero-loss manifold. We formally prove this in the limit of infinitesimally small learning rates and weight decay coefficients. To further dissect this regime, we introduce an approximation that decouples the learning dynamics of a subset of parameters from the rest of the network. Applying this framework, we derive a closed-form expression for the post-memorization dynamics of…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 3

Strengths

- The paper is overall well written and easy to follow. - The topic is definitely timely and relevant and the analysis is interesting. In particular, the approach taken to study isolated dynamics could be relevant in other settings.

Weaknesses

- Some very relevant papers are missing for the literature review. In particular, Q1 investigated in this paper has already been quite thoroughly addressed in [1]. [2] also shows that this behaviour extends to other form of regularization, beyond weight decay. While the technical tools used in this submission to support the hypothesis that grokking is an artifact of regularization (memorization induced by data fitting term followed by slow convergence to generalizing solution driven by regulariz

Reviewer 02Rating 4Confidence 4

Strengths

**Crisp, elegant result:** The overall takeaway is quite elegant and easy to understand. **Very well-written**: This was an exceptionally clear paper. I was really impressed with how easy this was to read. Similarly, the figures were wonderfully designed. **Interesting methods**: The separation of dynamical systems into slow variables and fast variables that can be integrated out is a pillar of dynamical systems analysis. Applying it in this setting to understand learning on the zero-loss set

Weaknesses

**1. Missing empirical validation.** After solving for the closed-form expression of the post-memorization dynamics of (part of) a grokking model, the authors show that simulating this training process reproduces two of the behaviors associated with grokking: delayed generalization and circular representation learning. But, if I understand correctly, this does not establish the central theoretical claim (lines 113-118): > While the previous example illustrates our theoretical framework, it does

Reviewer 03Rating 2Confidence 4

Strengths

The general idea -- that in the presence of weight decay dynamics on the zero loss manifold are determined by the weight decay term -- is (trivially) correct.

Weaknesses

The actual theorems seem trivial to me, and the presentation is over complicated. If one assumes that the dynamics has reached the zero loss manifold, then by definition further dynamics are not controlled by loss gradients (which vanish) but only by the regularization, if present. However, this does not answer two things which are crucial for the proposed mechanism to work: 1. why should the dynamics reach the zero loss manifold at all, in the presence of regularization? 2. Assuming it does, w

Reviewer 04Rating 2Confidence 3

Strengths

Obtaining a better theoretical understanding of grokking is an interesting question that has attracted interest in recent years. Explaining grokking by analyzing the dynamics after memorization is a natural approach, and, as the paper argues, weight decay seems to play a key role in the post-memorization dynamics.

Weaknesses

The bottom line is that I don’t think the contributions of the paper are significant enough. In Section 4, the main contributions are Theorems 4.9 and 4.13. Theorem 4.9 provides a nice observation, namely, that with a small enough weight decay, once gradient flow reaches the zero-loss manifold, it will stay close to it. Then, Theorem 4.13 establishes that, under some assumptions, around the zero-loss manifold, the gradient of the unregularized loss will be roughly orthogonal to the manifold. H

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing · Advanced Memory and Neural Computing