Compressible Softmax-Attended Language under Incompressible Attention

Wonsuk Lee

arXiv:2604.04384·cs.CL·April 9, 2026

Compressible Softmax-Attended Language under Incompressible Attention

Wonsuk Lee

PDF

TL;DR

This paper analyzes the spectral properties of softmax attention in transformer language models, showing that most of the interaction energy is captured by a small number of components, indicating high compressibility.

Contribution

It decomposes the attention logit field into learned and generated parts and quantifies their spectral spectra across multiple models, revealing the data-driven nature of compressibility.

Findings

01

90% of logit variance captured by 2-11 singular components

02

Learned interaction matrix requires 38-75 components for similar variance

03

Spectral gap indicates high effective rank disparity

Abstract

Softmax attention defines an interaction through $d_{h}$ head dimensions, but not all dimensions carry equal weight once real text passes through. We decompose the attention logit field into a learned component and a generated component and measure their spectra separately. For all 5,888 KV heads in five transformer language models (124M--7B parameters, four architecture families), the logit energy field $\tilde{E}$ reaches 90\% of its variance in 2--11 singular components. The learned interaction matrix $W_{Q}^{T} W_{K}$ needs 38--75 components for the same threshold out of $d_{h} \in 64, 128$ . The spectral gap is 5--25 $\times$ in effective rank. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.