Decomposing Query-Key Feature Interactions Using Contrastive Covariances
Andrew Lee, Yonatan Belinkov, Fernanda Vi\'egas, Martin Wattenberg

TL;DR
This paper introduces a contrastive covariance method to decompose the query-key space in Transformers, enabling interpretability of attention mechanisms by identifying human-understandable feature interactions.
Contribution
The paper proposes a novel contrastive covariance approach to analyze and interpret query-key interactions in large language models, revealing low-rank, human-interpretable components.
Findings
Identified interpretable query-key subspaces for semantic features
Demonstrated attribution of attention scores to specific features
Validated the method analytically and empirically in simplified and large models
Abstract
Despite the central role of attention heads in Transformers, we lack tools to understand why a model attends to a particular token. To address this, we study the query-key (QK) space -- the bilinear joint embedding space between queries and keys. We present a contrastive covariance method to decompose the QK space into low-rank, human-interpretable components. It is when features in keys and queries align in these low-rank subspaces that high attention scores are produced. We first study our method both analytically and empirically in a simplified setting. We then apply our method to large language models to identify human-interpretable QK subspaces for categorical semantic features and binding features. Finally, we demonstrate how attention scores can be attributed to our identified features.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Advanced Graph Neural Networks · Topic Modeling
