A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization
Shalima Binta Manir, Anamika Paul Rupa

TL;DR
This paper systematically investigates the grokking phenomenon in neural networks, revealing that optimization stability and regularization, rather than architecture alone, primarily govern delayed generalization, with implications for model design.
Contribution
It provides a controlled, systematic empirical study disentangling architecture, optimization, and regularization effects on grokking, challenging prior architecture-centric views.
Findings
Depth has a non-monotonic effect on grokking.
Transformer and MLP differences largely vanish under matched hyperparameters.
Weight decay critically controls the grokking regime.
Abstract
Grokking the delayed transition from memorization to generalization in neural networks remains poorly understood, in part because prior empirical studies confound the roles of architecture, optimization, and regularization. We present a controlled study that systematically disentangles these factors on modular addition (mod 97), with matched and carefully tuned training regimes across models. Our central finding is that grokking dynamics are not primarily determined by architecture, but by interactions between optimization stability and regularization. Specifically, we show: (1) \textbf{depth has a non-monotonic effect}, with depth-4 MLPs consistently failing to grok while depth-8 residual networks recover generalization, demonstrating that depth requires architectural stabilization; (2) \textbf{the apparent gap between Transformers and MLPs largely disappears} (1.11 delay)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Memory and Neural Computing · Neural Networks and Reservoir Computing
