The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

Yi Liu

arXiv:2604.22778·cs.LG·April 28, 2026

The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

Yi Liu

PDF

TL;DR

This study systematically analyzes transformer weight spectra during training, revealing transient compression waves, persistent spectral gradients, and asymmetries in value and query/key projections, with implications for model importance and pruning.

Contribution

It introduces the first comprehensive spectral analysis during transformer training, uncovering novel phenomena and formalizing a two-timescale dynamical model with practical scaling laws.

Findings

01

Rank compression propagates as a wave from early to late layers.

02

Spectral gradient develops a depth-dependent inverted-U shape.

03

Spectral features predict layer importance and improve pruning performance.

Abstract

We present the first systematic study of weight matrix singular value spectra \emph{during} transformer pretraining, tracking full SVD decompositions of every weight matrix at 25-step intervals across three model scales (30M--285M parameters). We discover three phenomena: \textbf{(1)~Transient Compression Waves:} stable rank compression propagates as a traveling wave from early to late layers, creating a dramatic gradient that peaks early then \emph{reverses} -- late layers eventually over-compress past early layers. \textbf{(2)~Persistent Spectral Gradients:} the power-law exponent~ $α$ develops a permanent depth gradient forming a non-monotonic inverted-U in deeper models, with peaks shifting toward earlier layers as depth increases. \textbf{(3)~Q/K--V Functional Asymmetry:} value/output projections compress uniformly while query/key projections carry the full depth-dependent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.