Incorporating Convolution Designs into Visual Transformers
Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, Wei Wu

TL;DR
This paper introduces CeiT, a convolution-enhanced Transformer for vision tasks that combines CNN and Transformer strengths, achieving high performance with less data and training time.
Contribution
It proposes a novel CeiT architecture with three key modifications, improving efficiency and accuracy over previous vision Transformers and CNNs.
Findings
CeiT outperforms previous Transformers and CNNs on ImageNet and downstream tasks.
CeiT requires fewer training iterations, reducing training cost.
CeiT achieves comparable or better accuracy without large data or extra supervision.
Abstract
Motivated by the success of Transformers in natural language processing (NLP) tasks, there emerge some attempts (e.g., ViT and DeiT) to apply Transformers to the vision domain. However, pure Transformer architectures often require a large amount of training data or extra supervision to obtain comparable performance with convolutional neural networks (CNNs). To overcome these limitations, we analyze the potential drawbacks when directly borrowing Transformer architectures from NLP. Then we propose a new \textbf{Convolution-enhanced image Transformer (CeiT)} which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Three modifications are made to the original Transformer: \textbf{1)} instead of the straightforward tokenization from raw input images, we design an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsLinear Layer · Convolution-enhanced image Transformer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Residual Connection · Layer Normalization · Adam · Dense Connections · Softmax
