Tokens-to-Token ViT: Training Vision Transformers from Scratch on   ImageNet

Li Yuan; Yunpeng Chen; Tao Wang; Weihao Yu; Yujun Shi; Zihang Jiang,; Francis EH Tay; Jiashi Feng; Shuicheng Yan

arXiv:2101.11986·cs.CV·December 1, 2021·32 cites

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang,, Francis EH Tay, Jiashi Feng, Shuicheng Yan

PDF

Open Access 5 Repos

TL;DR

This paper introduces T2T-ViT, a vision transformer that models local image structures more effectively, reduces computational costs, and outperforms traditional CNNs when trained from scratch on ImageNet.

Contribution

The paper proposes a novel Tokens-to-Token transformation and a deep-narrow backbone, improving training efficiency and accuracy of vision transformers on ImageNet.

Findings

01

T2T-ViT reduces parameters and MACs by half compared to vanilla ViT.

02

Achieves over 3% accuracy improvement when trained from scratch on ImageNet.

03

Outperforms ResNets and matches MobileNets in accuracy.

Abstract

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Tokens-To-Token Vision Transformer · Vision Transformer · Depthwise Convolution · Pointwise Convolution · Batch Normalization · Depthwise Separable Convolution · *Communicated@Fast*How Do I Communicate to Expedia?