Late-Stage Generalization Collapse in Grokking: Detecting anti-grokking with Weightwatcher
Hari K Prakash, Charles H Martin

TL;DR
This paper identifies a new phase called anti-grokking, where neural networks lose their generalization after initially succeeding, and introduces spectral diagnostics using WeightWatcher to detect this phenomenon.
Contribution
The study uncovers anti-grokking as a late-stage collapse of generalization and proposes spectral eigenvalue analysis with WeightWatcher as a diagnostic tool.
Findings
Anti-grokking occurs after successful generalization, leading to test accuracy collapse.
Correlation Traps, large eigenvalues in weight spectra, signal anti-grokking.
Large-scale LLMs exhibit similar anti-grokking pathologies.
Abstract
\emph{Memorization} in neural networks lacks a precise operational definition and is often inferred from the grokking regime, where training accuracy saturates while test accuracy remains very low. We identify a previously unreported third phase of grokking in this training regime: \emph{anti-grokking}, a late-stage collapse of generalization. We revisit two canonical grokking setups: a 3-layer MLP trained on a subset of MNIST and a transformer trained on modular addition, but extended training far beyond standard. In both cases, after models transition from pre-grokking to successful generalization, test accuracy collapses back to chance while training accuracy remains perfect, indicating a distinct post-generalization failure mode. To diagnose anti-grokking, we use the open-source \texttt{WeightWatcher} tool based on HTSR/SETOL theory. The primary signal is the emergence of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Machine Learning in Materials Science · Stochastic Gradient Optimization Techniques
