How Does Attention Help? Insights from Random Matrices on Signal Recovery from Sequence Models
Mohamed El Amine Seddik

TL;DR
This paper analyzes how attention mechanisms influence signal recovery in sequence models using random matrix theory, revealing phase transitions and optimal weights for improved signal detection.
Contribution
It provides exact spectral characterizations of pooled sequence representations and identifies optimal attention weights for maximizing signal-to-noise ratio.
Findings
Bulk spectrum follows a non-Marchenko--Pastur law due to vocabulary structure.
Signal recovery exhibits two BBP-type phase transitions.
Top eigenvector of R yields optimal attention weights for signal enhancement.
Abstract
We study the spectral properties of sample covariance matrices constructed from pooled sequence representations, where token embeddings are drawn from a fixed two-class Gaussian mixture table and pooled via (fixed) attention weights. Working in the high-dimensional regime with and , we derive exact characterizations of the limiting eigenvalue distribution, outlier eigenvalues, and eigenvector alignment with the hidden signal. The bulk spectrum follows a non-Marchenko--Pastur law given by the free multiplicative convolution , reflecting the finite vocabulary structure. Signal recovery undergoes two successive BBP-type phase transitions characterized by the scalars: and , where denotes the attention pooling weights and the positional correlation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
