Do Vision Transformers See Like Convolutional Neural Networks?
Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang,, Alexey Dosovitskiy

TL;DR
This paper compares Vision Transformers and CNNs, revealing fundamental differences in their internal representations, the role of self-attention and residuals, and implications for spatial localization and transfer learning.
Contribution
It provides a detailed analysis of how ViTs differ from CNNs internally, highlighting the impact of self-attention and residual connections on their representations and spatial information preservation.
Findings
ViTs have more uniform representations across layers.
Self-attention enables early global information aggregation.
ViTs preserve input spatial information effectively.
Abstract
Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This raises a central question: how are Vision Transformers solving these tasks? Are they acting like convolutional networks, or learning entirely different visual representations? Analyzing the internal representation structure of ViTs and CNNs on image classification benchmarks, we find striking differences between the two architectures, such as ViT having more uniform representations across all layers. We explore how these differences arise, finding crucial roles played by self-attention, which enables early aggregation of global information, and ViT residual connections, which strongly propagate features from lower to higher layers. We study the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Average Pooling · Layer Normalization · Adam · Label Smoothing · Refunds@Expedia|||How do I get a full refund from Expedia?
