The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
Yi Liu

TL;DR
This study systematically analyzes transformer weight spectra during training, revealing transient compression waves, persistent spectral gradients, and asymmetries in value and query/key projections, with implications for model importance and pruning.
Contribution
It introduces the first comprehensive spectral analysis during transformer training, uncovering novel phenomena and formalizing a two-timescale dynamical model with practical scaling laws.
Findings
Rank compression propagates as a wave from early to late layers.
Spectral gradient develops a depth-dependent inverted-U shape.
Spectral features predict layer importance and improve pruning performance.
Abstract
We present the first systematic study of weight matrix singular value spectra \emph{during} transformer pretraining, tracking full SVD decompositions of every weight matrix at 25-step intervals across three model scales (30M--285M parameters). We discover three phenomena: \textbf{(1)~Transient Compression Waves:} stable rank compression propagates as a traveling wave from early to late layers, creating a dramatic gradient that peaks early then \emph{reverses} -- late layers eventually over-compress past early layers. \textbf{(2)~Persistent Spectral Gradients:} the power-law exponent~ develops a permanent depth gradient forming a non-monotonic inverted-U in deeper models, with peaks shifting toward earlier layers as depth increases. \textbf{(3)~Q/K--V Functional Asymmetry:} value/output projections compress uniformly while query/key projections carry the full depth-dependent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
