ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Yufei Xu; Qiming Zhang; Jing Zhang; Dacheng Tao

arXiv:2106.03348·cs.CV·December 28, 2021·155 cites

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Yufei Xu, Qiming Zhang, Jing Zhang, Dacheng Tao

PDF

Open Access 2 Repos 1 Video

TL;DR

ViTAE introduces a novel vision transformer that incorporates intrinsic local and scale-invariance inductive biases through convolutional modules, enhancing feature learning and performance on vision tasks.

Contribution

The paper proposes ViTAE, a vision transformer that embeds multi-scale local features and scale invariance using convolutional modules, improving robustness and efficiency.

Findings

01

Outperforms baseline transformers on ImageNet

02

Achieves better feature representation for multi-scale objects

03

Demonstrates superior results on downstream vision tasks

Abstract

Transformers have shown great potential in various computer vision tasks owing to their strong capability in modeling long-range dependency using the self-attention mechanism. Nevertheless, vision transformers treat an image as 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance. Alternatively, they require large-scale training data and longer training schedules to learn the IB implicitly. In this paper, we propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE. Technically, ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context by using multiple convolutions with different dilation rates. In this way, it acquires an intrinsic scale invariance IB and is able to learn robust…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Residual Connection · Dense Connections