Dissecting Query-Key Interaction in Vision Transformers

Xu Pan; Aaron Philip; Ziqian Xie; Odelia Schwartz

arXiv:2405.14880·cs.CV·January 15, 2025·1 cites

Dissecting Query-Key Interaction in Vision Transformers

Xu Pan, Aaron Philip, Ziqian Xie, Odelia Schwartz

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper analyzes the query-key interactions in vision transformers using singular value decomposition, revealing how attention shifts from similar to dissimilar tokens across layers and providing insights into their interpretability and contextual understanding.

Contribution

It introduces a novel analysis method for vision transformers' attention mechanisms, highlighting the semantic and interpretable nature of feature interactions across layers.

Findings

01

Early layers focus on similar tokens, while later layers attend to dissimilar tokens.

02

Interactions between features are often interpretable and semantically meaningful.

03

Provides a new perspective on how transformers utilize context and salient features.

Abstract

Self-attention in vision transformers is often thought to perform perceptual grouping where tokens attend to other tokens with similar embeddings, which could correspond to semantically similar features of an object. However, attending to dissimilar tokens can be beneficial by providing contextual information. We propose to analyze the query-key interaction by the singular value decomposition of the interaction matrix (i.e. $W_{q}^{⊤} W_{k}$ ). We find that in many ViTs, especially those with classification training objectives, early layers attend more to similar tokens, while late layers show increased attention to dissimilar tokens, providing evidence corresponding to perceptual grouping and contextualization, respectively. Many of these interactions between features represented by singular vectors are interpretable and semantic, such as attention between relevant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

schwartz-cnl/DissectingViT
pytorchOfficial

Videos

Dissecting Query-Key Interaction in Vision Transformers· slideslive

Taxonomy

TopicsAdvanced Memory and Neural Computing · CCD and CMOS Imaging Sensors