Attention Transfer Is Not Universally Effective for Vision Transformers
Huaiyuan Qin, Muli Yang, Gabriel James Goenawan, Peng Hu, Chen Gong, Xi Peng, Hongyuan Zhu

TL;DR
This paper critically evaluates the effectiveness of Attention Transfer in Vision Transformers, revealing that architectural compatibility is crucial for successful transfer, and proposing solutions to overcome identified limitations.
Contribution
It demonstrates that attention transfer fails when student and teacher architectures differ, and shows that adding native components restores transfer effectiveness.
Findings
Attention transfer success depends on architectural match.
Adding teacher's native components to student reverses transfer failure.
Failure is not due to transfer loss choice or training recipes.
Abstract
A recent work shows that Attention Transfer, which transfers only the attention patterns from a pre-trained teacher Vision Transformer (ViT) to a randomly initialized standard student ViT, is sufficient to recover the full benefit of the teacher's pre-trained weights. We revisit this finding on a comprehensive benchmark of 20 teachers from 11 well-known ViT families and reveal that Attention Transfer is not universally effective. While 7 families transfer successfully, 4 consistently fail, falling up to 5.1\% below the from-scratch no-transfer baseline. Further results demonstrate that this failure is family-consistent across model sizes, and persists under extended training durations, different transfer datasets, and out-of-distribution evaluations. Controlled analyses then consistently localize the problem to the attention-routing channel, indicating that the key issue is not whether…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
