Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics
Lucky Verma

TL;DR
This paper investigates how weight decay influences training regimes in transformers on modular arithmetic, introducing online diagnostics to track transitions between memorization, generalization, and collapse across various models.
Contribution
It demonstrates that weight decay acts as an empirical control parameter for training regimes and introduces inexpensive online diagnostics for monitoring these transitions.
Findings
Weight decay separates memorization, grokking, and collapse regimes.
Empirical boundary for memorization-to-developmental transition at λ=0.0158.
Diagnostics based on attention similarity and entropy effectively track training dynamics.
Abstract
Transformers trained on modular arithmetic exhibit sharp transitions between memorization, generalization, and collapse. We show that weight decay acts as a scalar empirical control parameter for these regimes, and introduce two cheap online diagnostics, mean pairwise attention-head cosine similarity and entropy standard deviation, that track training dynamics from attention activations alone and complement loss-landscape diagnostics at lower compute cost. Across eleven experimental conditions and three model scales (0.82M to 85M parameters), the weight-decay axis separates memorization, developmental grokking, and collapse. A near-transition logistic fit localizes the memorization-to-developmental boundary at (95% CI [0.0109, 0.0200], N=210); a power-law fit gives an empirical exponent (CI [0.725, 0.799]). Reference exponents and 3D Ising $\nu…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
