The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression

Yongzhong Xu

arXiv:2604.07380·cs.LG·April 10, 2026

The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression

Yongzhong Xu

PDF

TL;DR

This paper analyzes the spectral edge during grokking, revealing a two-phase lifecycle where gradient and weight decay components align, leading to a compression axis that re-encodes information without loss.

Contribution

It introduces a decomposition of the spectral edge into gradient and weight decay components, uncovering a universal lifecycle and the role of compression in grokking.

Findings

01

Spectral edge exhibits a two-phase lifecycle during grokking.

02

Alignment of gradient and weight decay at grokking creates a critical compression axis.

03

Information is re-encoded rather than lost, and removing weight decay reverses compression.

Abstract

We decompose the spectral edge -- the dominant direction of the Gram matrix of parameter updates -- into its gradient and weight-decay components during grokking in two sequence tasks (Dyck-1 and SCAN). We find a sharp two-phase lifecycle: before grokking the edge is gradient-driven and functionally active; at grokking, gradient and weight decay align, and the edge becomes a compression axis that is perturbation-flat yet ablation-critical (>4000x more impactful than random directions). Three universality classes emerge (functional, mixed, compression), predicted by the gap flow equation. Nonlinear probes show information is re-encoded, not lost (MLP $R^{2} = 0.99$ where linear $R^{2} = 0.86$ ), and removing weight decay post-grok reverses compression while preserving the algorithm.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.