Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning
Junxuan Wang, Xuyang Ge, Wentao Shu, Zhengfu He, Xipeng Qiu

TL;DR
This paper reveals that transformer attention outputs are confined to a low-dimensional subspace, which impacts sparse dictionary learning, and proposes a subspace-constrained training method to mitigate dead features.
Contribution
The study uncovers the low-rank structure of attention outputs and introduces a subspace-constrained training approach to improve sparse autoencoders.
Findings
Attention outputs have about 60% effective dimensionality.
Low-rank structure causes dead feature problems in sparse learning.
Subspace-constrained training reduces dead features from 87% to below 1%.
Abstract
Transformer architectures, and their attention mechanisms in particular, form the foundation of modern large language models. While transformer models are widely believed to operate in high-dimensional hidden spaces, we show that attention outputs are in fact confined to a surprisingly low-dimensional subspace, with an effective dimensionality of only about of the full space. In contrast, MLP outputs and residual streams remain much closer to full-rank, exhibiting effective ranks around . This striking dimensional discrepancy is consistently observed across diverse model families and datasets, and is strongly shaped by the attention output projection matrix. Critically, we find this low-rank structure as a key factor of the prevalent dead feature problem in sparse dictionary learning, where it creates a mismatch between randomly initialized features and the intrinsic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
