Early-Warning Signals of Grokking via Loss-Landscape Geometry

Yongzhong Xu

arXiv:2602.16967·cs.LG·April 6, 2026

Early-Warning Signals of Grokking via Loss-Landscape Geometry

Yongzhong Xu

PDF

TL;DR

This paper identifies the commutator defect, a curvature measure from loss-landscape geometry, as a universal early-warning signal for grokking, the sudden transition from memorization to generalization in transformers.

Contribution

It demonstrates that the commutator defect predicts grokking across multiple tasks and is causally linked to the phenomenon, extending prior findings beyond modular arithmetic.

Findings

01

The commutator defect rises before grokking across tasks.

02

Amplifying non-commutativity accelerates grokking.

03

Suppressing gradient flow delays or prevents grokking.

Abstract

Grokking -- the abrupt transition from memorization to generalization after prolonged training -- has been linked to confinement on low-dimensional execution manifolds in modular arithmetic. Whether this mechanism extends beyond arithmetic remains open. We study two sequence-learning benchmarks: SCAN compositional generalization and Dyck-1 depth prediction. Across both tasks and a wide range of learning rates, the commutator defect -- a curvature measure derived from non-commuting gradient updates -- rises well before generalization, with lead times following a superlinear power law (alpha approximately 1.18 for SCAN, approximately 1.13 for Dyck), consistent with prior results on modular arithmetic. Weight-space PCA reveals that spectral concentration is not a universal precursor; the commutator defect is. Causal interventions demonstrate a mechanistic role: amplifying non-commutativity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.