Distributional Spectral Diagnostics for Localizing Grokking Transitions
Ziyue Wang, Yufeng Ying, Takafumi Kanamori

TL;DR
This paper introduces a spectral diagnostic method using Wasserstein and Hankel DMD techniques to localize grokking transitions in training trajectories, enabling early detection before test accuracy improves.
Contribution
It formulates grokking transition localization as a diagnostic problem and demonstrates high AUROC in discriminating grokking from non-grokking on Transformer models.
Findings
Residual-based diagnostics achieve AUROC ≈ 0.93 for transition detection.
High-residual windows show about 3× larger perturbation deviations.
Norm signals and log-probability are effective regime indicators.
Abstract
In grokking, a model first fits the training data while test accuracy remains low, and only later begins to generalize. We ask whether this transition can be localized from observed training trajectories before the test accuracy rises, and formulate grokking transition localization as a diagnostic problem with an explicit threshold/FPR/lead-time trade-off. Task-dependent observables are summarized as empirical distributions, mapped to Wasserstein/quantile coordinates, and analyzed by Hankel dynamic mode decomposition (DMD); the resulting reconstruction residual, together with spectrum and effective rank, forms the diagnostic output. On held-out modular-addition Transformer runs, the residual achieves AUROC \(\approx \) 0.93 for grokking-vs-non-grokking discrimination at the run level; under a fixed sustained-threshold operating rule, true-positive alarms can precede onset, with lead…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
