A Random Matrix Theory Perspective on the Learning Dynamics of Multi-head Latent Attention
Nandan Kumar Jha, Brandon Reagen

TL;DR
This paper uses random matrix theory to analyze how multi-head latent attention affects transformer capacity during training, revealing that sharing rotary embeddings across heads preserves spectral support and prevents capacity collapse.
Contribution
It introduces a spectral analysis framework for multi-head attention variants, showing that sharing rotary embeddings across heads maintains model capacity during training.
Findings
Decoupled rotary sharing prevents spectral collapse.
Capacity bottlenecks occur early and locally in training.
Sharing rotary components mitigates spectral fragmentation.
Abstract
In this work, we study how multi-head latent attention (MLA), a popular strategy for compressing key/value memory, affects a transformer's internal capacity during pretraining. Using a lightweight suite of Marchenko-Pastur (MP) diagnostics, we analyze the spectrum of the gram matrix throughout training, comparing three variants: the standard multi-head attention (MHA) baseline, MLA-PreRoPE with rotary applied before compression, and MLA-Decoupled, which shares a single rotary sub-vector across all heads. Our random matrix analysis reveals \textbf{three key findings:} \textbf{ i)} capacity bottlenecks emerge locally: both MHA and MLA-PreRoPE exhibit sharp, early spikes in specific layers that persist and propagate, disrupting the balance between bulk and outlier directions; \textbf{ ii)} these spikes coincide with rank collapse, concentrating the model's expressivity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
