Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking
Yongzhong Xu

TL;DR
This paper investigates the geometric properties of optimization dynamics in transformers during grokking, revealing low-dimensional confinement and transverse curvature growth as key factors in the transition to generalization.
Contribution
It provides a geometric analysis showing that grokking involves low-dimensional dynamics and curvature growth orthogonal to the main trajectory, with causal experiments confirming their roles.
Findings
Training evolves mainly within a low-dimensional subspace.
Curvature grows sharply in directions orthogonal to this subspace before grokking.
Orthogonal gradient flow is necessary for grokking, but increasing curvature alone is insufficient.
Abstract
Grokking -- the delayed transition from memorization to generalization in small algorithmic tasks -- remains poorly understood. We present a geometric analysis of optimization dynamics in transformers trained on modular arithmetic. PCA of attention weight trajectories reveals that training evolves predominantly within a low-dimensional execution subspace, with a single principal component capturing 68-83% of trajectory variance. To probe loss-landscape geometry, we measure commutator defects -- the non-commutativity of successive gradient steps -- and project them onto this learned subspace. We find that curvature grows sharply in directions orthogonal to the execution subspace while the trajectory remains largely confined to it. Importantly, curvature growth consistently precedes generalization across learning rates and hyperparameter regimes, with the lead time obeying a power law in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
