Incorporating Convolution Designs into Visual Transformers

Kun Yuan; Shaopeng Guo; Ziwei Liu; Aojun Zhou; Fengwei Yu; Wei Wu

arXiv:2103.11816·cs.CV·April 21, 2021·5 cites

Incorporating Convolution Designs into Visual Transformers

Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, Wei Wu

PDF

Open Access 3 Repos

TL;DR

This paper introduces CeiT, a convolution-enhanced Transformer for vision tasks that combines CNN and Transformer strengths, achieving high performance with less data and training time.

Contribution

It proposes a novel CeiT architecture with three key modifications, improving efficiency and accuracy over previous vision Transformers and CNNs.

Findings

01

CeiT outperforms previous Transformers and CNNs on ImageNet and downstream tasks.

02

CeiT requires fewer training iterations, reducing training cost.

03

CeiT achieves comparable or better accuracy without large data or extra supervision.

Abstract

Motivated by the success of Transformers in natural language processing (NLP) tasks, there emerge some attempts (e.g., ViT and DeiT) to apply Transformers to the vision domain. However, pure Transformer architectures often require a large amount of training data or extra supervision to obtain comparable performance with convolutional neural networks (CNNs). To overcome these limitations, we analyze the potential drawbacks when directly borrowing Transformer architectures from NLP. Then we propose a new \textbf{Convolution-enhanced image Transformer (CeiT)} which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Three modifications are made to the original Transformer: \textbf{1)} instead of the straightforward tokenization from raw input images, we design an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsLinear Layer · Convolution-enhanced image Transformer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Residual Connection · Layer Normalization · Adam · Dense Connections · Softmax