Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

Lucky Verma

arXiv:2605.20441·cs.LG·May 21, 2026

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

Lucky Verma

PDF

1 Datasets

TL;DR

This paper investigates how weight decay influences training regimes in transformers on modular arithmetic, introducing online diagnostics to track transitions between memorization, generalization, and collapse across various models.

Contribution

It demonstrates that weight decay acts as an empirical control parameter for training regimes and introduces inexpensive online diagnostics for monitoring these transitions.

Findings

01

Weight decay separates memorization, grokking, and collapse regimes.

02

Empirical boundary for memorization-to-developmental transition at λ=0.0158.

03

Diagnostics based on attention similarity and entropy effectively track training dynamics.

Abstract

Transformers trained on modular arithmetic exhibit sharp transitions between memorization, generalization, and collapse. We show that weight decay acts as a scalar empirical control parameter for these regimes, and introduce two cheap online diagnostics, mean pairwise attention-head cosine similarity and entropy standard deviation, that track training dynamics from attention activations alone and complement loss-landscape diagnostics at lower compute cost. Across eleven experimental conditions and three model scales (0.82M to 85M parameters), the weight-decay axis separates memorization, developmental grokking, and collapse. A near-transition logistic fit localizes the memorization-to-developmental boundary at $λ_{c} = 0.0158$ (95% CI [0.0109, 0.0200], N=210); a power-law fit gives an empirical exponent $ν = 0.757$ (CI [0.725, 0.799]). Reference exponents $ν = 1/2$ and 3D Ising $\nu…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

lucky-verma/grokking-diagnostics-runs
dataset· 3.3k dl
3.3k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.