Demystify Self-Attention in Vision Transformers from a Semantic Perspective: Analysis and Application
Leijie Wu, Song Guo, Yaohong Ding, Junxiao Wang, Wenchao Xu, Richard, Yida Xu, Jie Zhang

TL;DR
This paper introduces a novel method using SIFT features to interpret and analyze the self-attention mechanisms in Vision Transformers, addressing the challenge of understanding how MSA works in visual data.
Contribution
It proposes a scale-invariant feature transform-based analysis to interpret MSA in ViT, enabling applications like spurious correlation detection and pre-training acceleration.
Findings
Effective interpretation of MSA in ViT using SIFT keypoints
Improved detection of spurious correlations during inference
Accelerated model pre-training with guided analysis
Abstract
Self-attention mechanisms, especially multi-head self-attention (MSA), have achieved great success in many fields such as computer vision and natural language processing. However, many existing vision transformer (ViT) works simply inherent transformer designs from NLP to adapt vision tasks, while ignoring the fundamental difference between ``how MSA works in image and language settings''. Language naturally contains highly semantic structures that are directly interpretable by humans. Its basic unit (word) is discrete without redundant information, which readily supports interpretable studies on MSA mechanisms of language transformer. In contrast, visual data exhibits a fundamentally different structure: Its basic unit (pixel) is a natural low-level representation with significant redundancies in the neighbourhood, which poses obvious challenges to the interpretability of MSA mechanism…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Layer Normalization · Residual Connection · Vision Transformer
