ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
Yufei Xu, Qiming Zhang, Jing Zhang, Dacheng Tao

TL;DR
ViTAE introduces a novel vision transformer that incorporates intrinsic local and scale-invariance inductive biases through convolutional modules, enhancing feature learning and performance on vision tasks.
Contribution
The paper proposes ViTAE, a vision transformer that embeds multi-scale local features and scale invariance using convolutional modules, improving robustness and efficiency.
Findings
Outperforms baseline transformers on ImageNet
Achieves better feature representation for multi-scale objects
Demonstrates superior results on downstream vision tasks
Abstract
Transformers have shown great potential in various computer vision tasks owing to their strong capability in modeling long-range dependency using the self-attention mechanism. Nevertheless, vision transformers treat an image as 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance. Alternatively, they require large-scale training data and longer training schedules to learn the IB implicitly. In this paper, we propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE. Technically, ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context by using multiple convolutions with different dilation rates. In this way, it acquires an intrinsic scale invariance IB and is able to learn robust…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Residual Connection · Dense Connections
