On the Surprising Effectiveness of Attention Transfer for Vision   Transformers

Alexander C. Li; Yuandong Tian; Beidi Chen; Deepak Pathak; Xinlei Chen

arXiv:2411.09702·cs.LG·November 15, 2024

On the Surprising Effectiveness of Attention Transfer for Vision Transformers

Alexander C. Li, Yuandong Tian, Beidi Chen, Deepak Pathak, Xinlei Chen

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that attention patterns alone, transferred from a pre-trained Vision Transformer, are sufficient for training high-performing models from scratch, challenging the traditional view that feature pre-training is essential.

Contribution

It introduces attention transfer as a simple method to achieve competitive performance without relying on learned features from pre-training.

Findings

01

Attention transfer achieves comparable accuracy to fine-tuning.

02

Ensembling attention-transferred models with teachers improves performance.

03

Attention patterns are sufficient for effective training even under distribution shifts.

Abstract

Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations learned during pre-training are not essential. Surprisingly, using only the attention patterns from pre-training (i.e., guiding how information flows between tokens) is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance. We show this by introducing a simple method called attention transfer, where only the attention patterns from a pre-trained teacher ViT are transferred to a student, either by copying or distilling the attention maps. Since attention transfer lets the student learn its own features, ensembling it with a fine-tuned teacher also further improves accuracy on ImageNet. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alexlioralexli/attention-transfer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Infrared Target Detection Methodologies

MethodsSoftmax · Attention Is All You Need