A Random Matrix Theory Perspective on the Learning Dynamics of Multi-head Latent Attention

Nandan Kumar Jha; Brandon Reagen

arXiv:2507.09394·cs.LG·July 15, 2025

A Random Matrix Theory Perspective on the Learning Dynamics of Multi-head Latent Attention

Nandan Kumar Jha, Brandon Reagen

PDF

Open Access

TL;DR

This paper uses random matrix theory to analyze how multi-head latent attention affects transformer capacity during training, revealing that sharing rotary embeddings across heads preserves spectral support and prevents capacity collapse.

Contribution

It introduces a spectral analysis framework for multi-head attention variants, showing that sharing rotary embeddings across heads maintains model capacity during training.

Findings

01

Decoupled rotary sharing prevents spectral collapse.

02

Capacity bottlenecks occur early and locally in training.

03

Sharing rotary components mitigates spectral fragmentation.

Abstract

In this work, we study how multi-head latent attention (MLA), a popular strategy for compressing key/value memory, affects a transformer's internal capacity during pretraining. Using a lightweight suite of Marchenko-Pastur (MP) diagnostics, we analyze the spectrum of the $W_{Q} W_{K}^{⊤}$ gram matrix throughout training, comparing three variants: the standard multi-head attention (MHA) baseline, MLA-PreRoPE with rotary applied before compression, and MLA-Decoupled, which shares a single rotary sub-vector across all heads. Our random matrix analysis reveals \textbf{three key findings:} \textbf{ i)} capacity bottlenecks emerge locally: both MHA and MLA-PreRoPE exhibit sharp, early spikes in specific layers that persist and propagate, disrupting the balance between bulk and outlier directions; \textbf{ ii)} these spikes coincide with rank collapse, concentrating the model's expressivity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications