All Tokens Matter: Token Labeling for Training Better Vision   Transformers

Zihang Jiang; Qibin Hou; Li Yuan; Daquan Zhou; Yujun Shi; Xiaojie Jin,; Anran Wang; Jiashi Feng

arXiv:2104.10858·cs.CV·June 10, 2021·141 cites

All Tokens Matter: Token Labeling for Training Better Vision Transformers

Zihang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin,, Anran Wang, Jiashi Feng

PDF

Open Access 5 Repos 1 Video

TL;DR

This paper introduces token labeling, a new training objective for vision transformers that leverages all patch tokens with location-specific supervision, significantly improving performance on ImageNet and downstream dense prediction tasks.

Contribution

It proposes token labeling, a novel dense supervision method for ViTs, enhancing accuracy and generalization over traditional class token training.

Findings

01

Achieves 84.4% Top-1 accuracy on ImageNet with a 26M parameter ViT.

02

Scaling to 150M parameters increases accuracy to 86.4%.

03

Improves generalization on dense prediction tasks like semantic segmentation.

Abstract

In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs). Different from the standard training objective of ViTs that computes the classification loss on an additional trainable class token, our proposed one takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator. Experiments show that token labeling can clearly and consistently improve the performance of various ViT models across a wide spectrum. For a vision transformer with 26M learnable parameters serving as an example, with token labeling, the model can achieve 84.4% Top-1 accuracy on ImageNet. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

All Tokens Matter: Token Labeling for Training Better Vision Transformers· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Linear Layer · Residual Connection · Layer Normalization · LV-ViT · Softmax · Dense Connections · Attention Is All You Need · Vision Transformer