Demystify Self-Attention in Vision Transformers from a Semantic   Perspective: Analysis and Application

Leijie Wu; Song Guo; Yaohong Ding; Junxiao Wang; Wenchao Xu; Richard; Yida Xu; Jie Zhang

arXiv:2211.08543·cs.CV·November 17, 2022

Demystify Self-Attention in Vision Transformers from a Semantic Perspective: Analysis and Application

Leijie Wu, Song Guo, Yaohong Ding, Junxiao Wang, Wenchao Xu, Richard, Yida Xu, Jie Zhang

PDF

Open Access

TL;DR

This paper introduces a novel method using SIFT features to interpret and analyze the self-attention mechanisms in Vision Transformers, addressing the challenge of understanding how MSA works in visual data.

Contribution

It proposes a scale-invariant feature transform-based analysis to interpret MSA in ViT, enabling applications like spurious correlation detection and pre-training acceleration.

Findings

01

Effective interpretation of MSA in ViT using SIFT keypoints

02

Improved detection of spurious correlations during inference

03

Accelerated model pre-training with guided analysis

Abstract

Self-attention mechanisms, especially multi-head self-attention (MSA), have achieved great success in many fields such as computer vision and natural language processing. However, many existing vision transformer (ViT) works simply inherent transformer designs from NLP to adapt vision tasks, while ignoring the fundamental difference between ``how MSA works in image and language settings''. Language naturally contains highly semantic structures that are directly interpretable by humans. Its basic unit (word) is discrete without redundant information, which readily supports interpretable studies on MSA mechanisms of language transformer. In contrast, visual data exhibits a fundamentally different structure: Its basic unit (pixel) is a natural low-level representation with significant redundancies in the neighbourhood, which poses obvious challenges to the interpretability of MSA mechanism…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Layer Normalization · Residual Connection · Vision Transformer