DeepViT: Towards Deeper Vision Transformer
Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian,, Zihang Jiang, Qibin Hou, Jiashi Feng

TL;DR
This paper identifies the attention collapse issue in deep vision transformers and proposes a re-attention method to enhance their depth and performance, achieving significant accuracy improvements on ImageNet.
Contribution
It introduces Re-attention, a simple technique to diversify attention maps in deep ViTs, enabling effective training of deeper models with improved accuracy.
Findings
Deeper ViTs suffer from attention collapse, reducing effectiveness.
Re-attention increases attention diversity with minimal cost.
Deep ViT with 32 layers improves Top-1 accuracy by 1.6% on ImageNet.
Abstract
Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Brain Tumor Detection and Classification
MethodsRe-Attention Module · DeepViT · Convolution
