All Tokens Matter: Token Labeling for Training Better Vision Transformers
Zihang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin,, Anran Wang, Jiashi Feng

TL;DR
This paper introduces token labeling, a new training objective for vision transformers that leverages all patch tokens with location-specific supervision, significantly improving performance on ImageNet and downstream dense prediction tasks.
Contribution
It proposes token labeling, a novel dense supervision method for ViTs, enhancing accuracy and generalization over traditional class token training.
Findings
Achieves 84.4% Top-1 accuracy on ImageNet with a 26M parameter ViT.
Scaling to 150M parameters increases accuracy to 86.4%.
Improves generalization on dense prediction tasks like semantic segmentation.
Abstract
In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs). Different from the standard training objective of ViTs that computes the classification loss on an additional trainable class token, our proposed one takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator. Experiments show that token labeling can clearly and consistently improve the performance of various ViT models across a wide spectrum. For a vision transformer with 26M learnable parameters serving as an example, with token labeling, the model can achieve 84.4% Top-1 accuracy on ImageNet. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Linear Layer · Residual Connection · Layer Normalization · LV-ViT · Softmax · Dense Connections · Attention Is All You Need · Vision Transformer
