Singular Vectors of Attention Heads Align with Features
Gabriel Franco, Carson Loughridge, Mark Crovella

TL;DR
This paper investigates when and why singular vectors of attention matrices in language models align with feature representations, providing theoretical justification and empirical evidence for their use in interpretability.
Contribution
It offers a theoretical framework and empirical validation for using singular vectors of attention matrices to identify features in language models.
Findings
Singular vectors align with features in observable models.
Theoretical conditions predict alignment in various scenarios.
Sparse attention decomposition indicates feature alignment in real models.
Abstract
Identifying feature representations in language models is a central task in mechanistic interpretability. Several recent studies have made an implicit assumption that feature representations can be inferred in some cases from singular vectors of attention matrices. However, sound justification for this assumption is lacking. In this paper we address that question, asking: why and when do singular vectors align with features? First, we demonstrate that singular vectors robustly align with features in a model where features can be directly observed. We then show theoretically that such alignment is expected under a range of conditions. We close by asking how, operationally, alignment may be recognized in real models where feature representations are not directly observable. We identify sparse attention decomposition as a testable prediction of alignment, and show evidence that it emerges…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Topic Modeling
