The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression
Yongzhong Xu

TL;DR
This paper analyzes the spectral edge during grokking, revealing a two-phase lifecycle where gradient and weight decay components align, leading to a compression axis that re-encodes information without loss.
Contribution
It introduces a decomposition of the spectral edge into gradient and weight decay components, uncovering a universal lifecycle and the role of compression in grokking.
Findings
Spectral edge exhibits a two-phase lifecycle during grokking.
Alignment of gradient and weight decay at grokking creates a critical compression axis.
Information is re-encoded rather than lost, and removing weight decay reverses compression.
Abstract
We decompose the spectral edge -- the dominant direction of the Gram matrix of parameter updates -- into its gradient and weight-decay components during grokking in two sequence tasks (Dyck-1 and SCAN). We find a sharp two-phase lifecycle: before grokking the edge is gradient-driven and functionally active; at grokking, gradient and weight decay align, and the edge becomes a compression axis that is perturbation-flat yet ablation-critical (>4000x more impactful than random directions). Three universality classes emerge (functional, mixed, compression), predicted by the gap flow equation. Nonlinear probes show information is re-encoded, not lost (MLP where linear ), and removing weight decay post-grok reverses compression while preserving the algorithm.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
