How Does Attention Work in Vision Transformers? A Visual Analytics Attempt
Yiran Li, Junpeng Wang, Xin Dai, Liang Wang, Chin-Chia Michael Yeh,, Yan Zheng, Wei Zhang, Kwan-Liu Ma

TL;DR
This paper employs visual analytics to interpret vision transformers by analyzing head importance, spatial attention distribution, and learned patterns, thereby deepening understanding of their inner workings.
Contribution
It introduces a comprehensive visual analytics framework to interpret ViT attention mechanisms, including head importance metrics, spatial attention profiling, and pattern summarization.
Findings
Identifies important attention heads using pruning metrics
Profiles spatial attention distributions within heads
Summarizes learned attention patterns with autoencoders
Abstract
Vision transformer (ViT) expands the success of transformer models from sequential data to images. The model decomposes an image into many smaller patches and arranges them into a sequence. Multi-head self-attentions are then applied to the sequence to learn the attention between patches. Despite many successful interpretations of transformers on sequential data, little effort has been devoted to the interpretation of ViTs, and many questions remain unanswered. For example, among the numerous attention heads, which one is more important? How strong are individual patches attending to their spatial neighbors in different heads? What attention patterns have individual heads learned? In this work, we answer these questions through a visual analytics approach. Specifically, we first identify what heads are more important in ViTs by introducing multiple pruning-based metrics. Then, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Visual perception and processing mechanisms
MethodsVisual Analytics
