DeepViT: Towards Deeper Vision Transformer

Daquan Zhou; Bingyi Kang; Xiaojie Jin; Linjie Yang; Xiaochen Lian,; Zihang Jiang; Qibin Hou; Jiashi Feng

arXiv:2103.11886·cs.CV·April 20, 2021·349 cites

DeepViT: Towards Deeper Vision Transformer

Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian,, Zihang Jiang, Qibin Hou, Jiashi Feng

PDF

Open Access 5 Repos

TL;DR

This paper identifies the attention collapse issue in deep vision transformers and proposes a re-attention method to enhance their depth and performance, achieving significant accuracy improvements on ImageNet.

Contribution

It introduces Re-attention, a simple technique to diversify attention maps in deep ViTs, enabling effective training of deeper models with improved accuracy.

Findings

01

Deeper ViTs suffer from attention collapse, reducing effectiveness.

02

Re-attention increases attention diversity with minimal cost.

03

Deep ViT with 32 layers improves Top-1 accuracy by 1.6% on ImageNet.

Abstract

Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Brain Tumor Detection and Classification

MethodsRe-Attention Module · DeepViT · Convolution