What do Vision Transformers Learn? A Visual Exploration
Amin Ghiasi, Hamid Kazemi, Eitan Borgnia, Steven Reich, Manli Shu,, Micah Goldblum, Andrew Gordon Wilson, Tom Goldstein

TL;DR
This paper visualizes and analyzes what vision transformers learn, revealing their reliance on semantic concepts, background features, and spatial information, and compares their behavior to CNNs across various models.
Contribution
It introduces methods to visualize ViTs, uncovers their semantic and background feature detection, and compares their internal representations to CNNs across multiple ViT variants.
Findings
ViTs trained with language supervision activate on semantic concepts.
Transformers detect background features similar to CNNs.
Spatial information is preserved in all layers except the last.
Abstract
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision, yet we understand very little about why they work and what they learn. While existing studies visually analyze the mechanisms of convolutional neural networks, an analogous exploration of ViTs remains challenging. In this paper, we first address the obstacles to performing visualizations on ViTs. Assisted by these solutions, we observe that neurons in ViTs trained with language model supervision (e.g., CLIP) are activated by semantic concepts rather than visual features. We also explore the underlying differences between ViTs and CNNs, and we find that transformers detect image background features, just like their convolutional counterparts, but their predictions depend far less on high-frequency information. On the other hand, both architecture types behave similarly in the way features…
Peer Reviews
Decision·Submitted to ICLR 2024
This work addresses an important problem in deep learning which is understanding how visual transformers work, and shed light on these black boxes. This is certainly helpful to, in particular, the vision community. The paper is generally well organized and well written. The prior research is also adequately mentioned. The findings, although, not all being quite novel, are interesting. In particular, I find the finding that transformers make better use of background and foreground information,
The work still does not get into the meat of what transformers really do! For example, what key, query, and value do? and what makes them more effective. For example, it is shown that they use foreground and background more effectively, but it is not explored why that happens. Another important aspect would be how the key,query,value operations relate to convolution. Some visualization may help get insights regarding this. Minor issues: Page 3 Recent work ? —> missing reference
- Visualization features for ViTs is an important but largely neglected topic. This work presents some solid feature visualization results and may inspire the community on related research.
- The novelty of the visualization method is limited. It mainly borrows the method of Olah et al. 2017 and adapt it on ViTs with some engineering tweaks. - Some observations of this work are not new. For example, the authors find that ViTs maintain spatial information in all layers except the last one, and the last layer produces very similar patch tokens. This behavior has been pointed out by some existing papers. Check “DeepViT: Towards Deeper Vision Transformer 2021.” It has shown that patch
1. Finding that ViTs learn to preserve spatial information despite lacking the inductive bias of CNN 2. Finding that the ViTs spatial information is lost in the last layer 3. Authors look into text guided ViTs such as CLIP in a different way than existing work which I think is an important contribution that I see will be useful for the community in understanding future vision language models
1. Section 2.1 last paragraph reference missing 'Related work ?' 2. Section 3, line 4 reference missing 'augmentation ensembling ?' 3. The authors claim that ViTs learn to preserve spatial information despite lacking the inductive bias of CNN but this property disappears from the last layer. The author seems to be not sure why (section 4, page 5). This is a key finding of the paper that needs more theory and/or experiment based proof
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Attention Dropout · Residual Connection · Softmax · Layer Normalization · Dropout · Co-Scale Conv-attentional Image Transformer · Feedforward Network
