How Do Vision Transformers Work?
Namuk Park, Songkuk Kim

TL;DR
This paper investigates the inner workings of Vision Transformers, revealing how multi-head self-attention improves accuracy and generalization, and proposing AlterNet, a hybrid model that outperforms CNNs across data regimes.
Contribution
It provides fundamental explanations of MSAs in Vision Transformers and introduces AlterNet, a novel hybrid architecture that enhances performance over traditional CNNs.
Findings
MSAs improve accuracy and generalization by flattening loss landscapes.
MSAs act as low-pass filters, complementing CNNs which are high-pass filters.
AlterNet, replacing CNN blocks with MSA blocks, outperforms CNNs in various data regimes.
Abstract
The success of multi-head self-attentions (MSAs) for computer vision is now indisputable. However, little is known about how MSAs work. We present fundamental explanations to help better understand the nature of MSAs. In particular, we demonstrate the following properties of MSAs and Vision Transformers (ViTs): (1) MSAs improve not only accuracy but also generalization by flattening the loss landscapes. Such improvement is primarily attributable to their data specificity, not long-range dependency. On the other hand, ViTs suffer from non-convex losses. Large datasets and loss landscape smoothing methods alleviate this problem; (2) MSAs and Convs exhibit opposite behaviors. For example, MSAs are low-pass filters, but Convs are high-pass filters. Therefore, MSAs and Convs are complementary; (3) Multi-stage neural networks behave like a series connection of small individual models. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Visual perception and processing mechanisms
MethodsAlterNet
