Compressible Softmax-Attended Language under Incompressible Attention
Wonsuk Lee

TL;DR
This paper analyzes the spectral properties of softmax attention in transformer language models, showing that most of the interaction energy is captured by a small number of components, indicating high compressibility.
Contribution
It decomposes the attention logit field into learned and generated parts and quantifies their spectral spectra across multiple models, revealing the data-driven nature of compressibility.
Findings
90% of logit variance captured by 2-11 singular components
Learned interaction matrix requires 38-75 components for similar variance
Spectral gap indicates high effective rank disparity
Abstract
Softmax attention defines an interaction through head dimensions, but not all dimensions carry equal weight once real text passes through. We decompose the attention logit field into a learned component and a generated component and measure their spectra separately. For all 5,888 KV heads in five transformer language models (124M--7B parameters, four architecture families), the logit energy field reaches 90\% of its variance in 2--11 singular components. The learned interaction matrix needs 38--75 components for the same threshold out of . The spectral gap is 5--25 in effective rank. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
