Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang,, Francis EH Tay, Jiashi Feng, Shuicheng Yan

TL;DR
This paper introduces T2T-ViT, a vision transformer that models local image structures more effectively, reduces computational costs, and outperforms traditional CNNs when trained from scratch on ImageNet.
Contribution
The paper proposes a novel Tokens-to-Token transformation and a deep-narrow backbone, improving training efficiency and accuracy of vision transformers on ImageNet.
Findings
T2T-ViT reduces parameters and MACs by half compared to vanilla ViT.
Achieves over 3% accuracy improvement when trained from scratch on ImageNet.
Outperforms ResNets and matches MobileNets in accuracy.
Abstract
Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Tokens-To-Token Vision Transformer · Vision Transformer · Depthwise Convolution · Pointwise Convolution · Batch Normalization · Depthwise Separable Convolution · *Communicated@Fast*How Do I Communicate to Expedia?
