The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology

Alper Y{\i}ld{\i}r{\i}m

arXiv:2603.05228·cs.LG·May 5, 2026

The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology

Alper Y{\i}ld{\i}r{\i}m

PDF

TL;DR

This paper demonstrates that architectural modifications in Transformers, such as normalization and uniform attention, can significantly bypass the grokking delay by aligning with task symmetries.

Contribution

It introduces architectural interventions that eliminate the grokking delay, highlighting the importance of architectural priors in training dynamics.

Findings

01

Normalization reduces grokking onset time by over 20x.

02

Uniform attention achieves 100% generalization without delay.

03

Spherical constraints on S5 do not accelerate generalization.

Abstract

Mechanistic interpretability typically relies on post-hoc analysis of trained networks. We instead adopt an interventional approach: testing hypotheses a priori by modifying architectural topology to observe training dynamics. We study grokking - delayed generalization in Transformers trained on cyclic modular addition (Zp) - investigating if specific architectural degrees of freedom prolong the memorization phase. We identify two independent structural factors in standard Transformers: unbounded representational magnitude and data-dependent attention routing. First, we introduce a fully bounded spherical topology enforcing L2 normalization throughout the residual stream and an unembedding matrix with a fixed temperature scale. This removes magnitude-based degrees of freedom, reducing grokking onset time by over 20x without weight decay. Second, a Uniform Attention Ablation overrides…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.