A Theoretical Framework for Grokking: Interpolation followed by Riemannian Norm Minimisation

Etienne Boursier; Scott Pesme; Radu-Alexandru Dragomir

arXiv:2505.20172·cs.LG·November 6, 2025

A Theoretical Framework for Grokking: Interpolation followed by Riemannian Norm Minimisation

Etienne Boursier, Scott Pesme, Radu-Alexandru Dragomir

PDF

Open Access 1 Video

TL;DR

This paper provides a theoretical explanation for the grokking phenomenon in deep learning, showing how weight decay induces a two-phase training dynamic involving rapid convergence followed by slow norm minimization, leading to generalization improvements.

Contribution

The paper introduces a mathematical framework explaining grokking as a two-phase process driven by weight decay, connecting gradient flow dynamics to generalization behavior.

Findings

01

Initial fast phase follows unregularised gradient flow to a critical manifold.

02

Slow drift phase minimizes parameter norm via Riemannian gradient flow.

03

Empirical validation on synthetic tasks supports the theoretical model.

Abstract

We study the dynamics of gradient flow with small weight decay on general training losses $F : R^{d} \to R$ . Under mild regularity assumptions and assuming convergence of the unregularised gradient flow, we show that the trajectory with weight decay $λ$ exhibits a two-phase behaviour as $λ \to 0$ . During the initial fast phase, the trajectory follows the unregularised gradient flow and converges to a manifold of critical points of $F$ . Then, at time of order $1/ λ$ , the trajectory enters a slow drift phase and follows a Riemannian gradient flow minimising the $ℓ_{2}$ -norm of the parameters. This purely optimisation-based phenomenon offers a natural explanation for the \textit{grokking} effect observed in deep learning, where the training loss rapidly reaches zero while the test loss plateaus for an extended period before suddenly improving. We argue that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

A Theoretical Framework for Grokking: Interpolation followed by Riemannian Norm Minimisation· slideslive

Taxonomy

TopicsAdvanced Numerical Analysis Techniques

MethodsWeight Decay