On the Surprising Effectiveness of Attention Transfer for Vision Transformers
Alexander C. Li, Yuandong Tian, Beidi Chen, Deepak Pathak, Xinlei Chen

TL;DR
This paper demonstrates that attention patterns alone, transferred from a pre-trained Vision Transformer, are sufficient for training high-performing models from scratch, challenging the traditional view that feature pre-training is essential.
Contribution
It introduces attention transfer as a simple method to achieve competitive performance without relying on learned features from pre-training.
Findings
Attention transfer achieves comparable accuracy to fine-tuning.
Ensembling attention-transferred models with teachers improves performance.
Attention patterns are sufficient for effective training even under distribution shifts.
Abstract
Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations learned during pre-training are not essential. Surprisingly, using only the attention patterns from pre-training (i.e., guiding how information flows between tokens) is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance. We show this by introducing a simple method called attention transfer, where only the attention patterns from a pre-trained teacher ViT are transferred to a student, either by copying or distilling the attention maps. Since attention transfer lets the student learn its own features, ensembling it with a fine-tuned teacher also further improves accuracy on ImageNet. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Infrared Target Detection Methodologies
MethodsSoftmax · Attention Is All You Need
