The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology
Alper Y{\i}ld{\i}r{\i}m

TL;DR
This paper demonstrates that architectural modifications in Transformers, such as normalization and uniform attention, can significantly bypass the grokking delay by aligning with task symmetries.
Contribution
It introduces architectural interventions that eliminate the grokking delay, highlighting the importance of architectural priors in training dynamics.
Findings
Normalization reduces grokking onset time by over 20x.
Uniform attention achieves 100% generalization without delay.
Spherical constraints on S5 do not accelerate generalization.
Abstract
Mechanistic interpretability typically relies on post-hoc analysis of trained networks. We instead adopt an interventional approach: testing hypotheses a priori by modifying architectural topology to observe training dynamics. We study grokking - delayed generalization in Transformers trained on cyclic modular addition (Zp) - investigating if specific architectural degrees of freedom prolong the memorization phase. We identify two independent structural factors in standard Transformers: unbounded representational magnitude and data-dependent attention routing. First, we introduce a fully bounded spherical topology enforcing L2 normalization throughout the residual stream and an unembedding matrix with a fixed temperature scale. This removes magnitude-based degrees of freedom, reducing grokking onset time by over 20x without weight decay. Second, a Uniform Attention Ablation overrides…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
