A Theoretical Framework for Grokking: Interpolation followed by Riemannian Norm Minimisation
Etienne Boursier, Scott Pesme, Radu-Alexandru Dragomir

TL;DR
This paper provides a theoretical explanation for the grokking phenomenon in deep learning, showing how weight decay induces a two-phase training dynamic involving rapid convergence followed by slow norm minimization, leading to generalization improvements.
Contribution
The paper introduces a mathematical framework explaining grokking as a two-phase process driven by weight decay, connecting gradient flow dynamics to generalization behavior.
Findings
Initial fast phase follows unregularised gradient flow to a critical manifold.
Slow drift phase minimizes parameter norm via Riemannian gradient flow.
Empirical validation on synthetic tasks supports the theoretical model.
Abstract
We study the dynamics of gradient flow with small weight decay on general training losses . Under mild regularity assumptions and assuming convergence of the unregularised gradient flow, we show that the trajectory with weight decay exhibits a two-phase behaviour as . During the initial fast phase, the trajectory follows the unregularised gradient flow and converges to a manifold of critical points of . Then, at time of order , the trajectory enters a slow drift phase and follows a Riemannian gradient flow minimising the -norm of the parameters. This purely optimisation-based phenomenon offers a natural explanation for the \textit{grokking} effect observed in deep learning, where the training loss rapidly reaches zero while the test loss plateaus for an extended period before suddenly improving. We argue that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Numerical Analysis Techniques
MethodsWeight Decay
