A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization

Shalima Binta Manir; Anamika Paul Rupa

arXiv:2603.25009·cs.LG·March 27, 2026

A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization

Shalima Binta Manir, Anamika Paul Rupa

PDF

Open Access

TL;DR

This paper systematically investigates the grokking phenomenon in neural networks, revealing that optimization stability and regularization, rather than architecture alone, primarily govern delayed generalization, with implications for model design.

Contribution

It provides a controlled, systematic empirical study disentangling architecture, optimization, and regularization effects on grokking, challenging prior architecture-centric views.

Findings

01

Depth has a non-monotonic effect on grokking.

02

Transformer and MLP differences largely vanish under matched hyperparameters.

03

Weight decay critically controls the grokking regime.

Abstract

Grokking the delayed transition from memorization to generalization in neural networks remains poorly understood, in part because prior empirical studies confound the roles of architecture, optimization, and regularization. We present a controlled study that systematically disentangles these factors on modular addition (mod 97), with matched and carefully tuned training regimes across models. Our central finding is that grokking dynamics are not primarily determined by architecture, but by interactions between optimization stability and regularization. Specifically, we show: (1) \textbf{depth has a non-monotonic effect}, with depth-4 MLPs consistently failing to grok while depth-8 residual networks recover generalization, demonstrating that depth requires architectural stabilization; (2) \textbf{the apparent gap between Transformers and MLPs largely disappears} (1.11 $\times$ delay)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Memory and Neural Computing · Neural Networks and Reservoir Computing