Dissecting Query-Key Interaction in Vision Transformers
Xu Pan, Aaron Philip, Ziqian Xie, Odelia Schwartz

TL;DR
This paper analyzes the query-key interactions in vision transformers using singular value decomposition, revealing how attention shifts from similar to dissimilar tokens across layers and providing insights into their interpretability and contextual understanding.
Contribution
It introduces a novel analysis method for vision transformers' attention mechanisms, highlighting the semantic and interpretable nature of feature interactions across layers.
Findings
Early layers focus on similar tokens, while later layers attend to dissimilar tokens.
Interactions between features are often interpretable and semantically meaningful.
Provides a new perspective on how transformers utilize context and salient features.
Abstract
Self-attention in vision transformers is often thought to perform perceptual grouping where tokens attend to other tokens with similar embeddings, which could correspond to semantically similar features of an object. However, attending to dissimilar tokens can be beneficial by providing contextual information. We propose to analyze the query-key interaction by the singular value decomposition of the interaction matrix (i.e. ). We find that in many ViTs, especially those with classification training objectives, early layers attend more to similar tokens, while late layers show increased attention to dissimilar tokens, providing evidence corresponding to perceptual grouping and contextualization, respectively. Many of these interactions between features represented by singular vectors are interpretable and semantic, such as attention between relevant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Memory and Neural Computing · CCD and CMOS Imaging Sensors
