Early-Warning Signals of Grokking via Loss-Landscape Geometry
Yongzhong Xu

TL;DR
This paper identifies the commutator defect, a curvature measure from loss-landscape geometry, as a universal early-warning signal for grokking, the sudden transition from memorization to generalization in transformers.
Contribution
It demonstrates that the commutator defect predicts grokking across multiple tasks and is causally linked to the phenomenon, extending prior findings beyond modular arithmetic.
Findings
The commutator defect rises before grokking across tasks.
Amplifying non-commutativity accelerates grokking.
Suppressing gradient flow delays or prevents grokking.
Abstract
Grokking -- the abrupt transition from memorization to generalization after prolonged training -- has been linked to confinement on low-dimensional execution manifolds in modular arithmetic. Whether this mechanism extends beyond arithmetic remains open. We study two sequence-learning benchmarks: SCAN compositional generalization and Dyck-1 depth prediction. Across both tasks and a wide range of learning rates, the commutator defect -- a curvature measure derived from non-commuting gradient updates -- rises well before generalization, with lead times following a superlinear power law (alpha approximately 1.18 for SCAN, approximately 1.13 for Dyck), consistent with prior results on modular arithmetic. Weight-space PCA reveals that spectral concentration is not a universal precursor; the commutator defect is. Causal interventions demonstrate a mechanistic role: amplifying non-commutativity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
