Late-Stage Generalization Collapse in Grokking: Detecting anti-grokking with Weightwatcher

Hari K Prakash; Charles H Martin

arXiv:2602.02859·cs.LG·February 4, 2026

Late-Stage Generalization Collapse in Grokking: Detecting anti-grokking with Weightwatcher

Hari K Prakash, Charles H Martin

PDF

Open Access

TL;DR

This paper identifies a new phase called anti-grokking, where neural networks lose their generalization after initially succeeding, and introduces spectral diagnostics using WeightWatcher to detect this phenomenon.

Contribution

The study uncovers anti-grokking as a late-stage collapse of generalization and proposes spectral eigenvalue analysis with WeightWatcher as a diagnostic tool.

Findings

01

Anti-grokking occurs after successful generalization, leading to test accuracy collapse.

02

Correlation Traps, large eigenvalues in weight spectra, signal anti-grokking.

03

Large-scale LLMs exhibit similar anti-grokking pathologies.

Abstract

\emph{Memorization} in neural networks lacks a precise operational definition and is often inferred from the grokking regime, where training accuracy saturates while test accuracy remains very low. We identify a previously unreported third phase of grokking in this training regime: \emph{anti-grokking}, a late-stage collapse of generalization. We revisit two canonical grokking setups: a 3-layer MLP trained on a subset of MNIST and a transformer trained on modular addition, but extended training far beyond standard. In both cases, after models transition from pre-grokking to successful generalization, test accuracy collapses back to chance while training accuracy remains perfect, indicating a distinct post-generalization failure mode. To diagnose anti-grokking, we use the open-source \texttt{WeightWatcher} tool based on HTSR/SETOL theory. The primary signal is the emergence of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Machine Learning in Materials Science · Stochastic Gradient Optimization Techniques