Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations
Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, Pengtao, Xie

TL;DR
This paper introduces EViT, a method that reorganizes image tokens during training to focus on attentive tokens, reducing computation and enabling higher resolution inputs for Vision Transformers, with minimal accuracy loss.
Contribution
EViT is the first approach to reorganize tokens during training, improving inference speed and accuracy by focusing on attentive tokens without adding parameters.
Findings
50% faster inference speed with 0.3% accuracy drop on DeiT-S
Enables higher resolution inputs at same computational cost
Effective on standard benchmarks
Abstract
Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head self-attention (MHSA) among them. Complete leverage of these image tokens brings redundant computations since not all the tokens are attentive in MHSA. Examples include that tokens containing semantically meaningless or distractive image backgrounds do not positively contribute to the ViT predictions. In this work, we propose to reorganize image tokens during the feed-forward process of ViT models, which is integrated into ViT during training. For each forward inference, we identify the attentive image tokens between MHSA and FFN (i.e., feed-forward network) modules, which is guided by the corresponding class token attention. Then, we reorganize image tokens by preserving attentive image tokens and fusing inattentive ones to expedite subsequent MHSA and FFN computations. To this end, our method EViT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Brain Tumor Detection and Classification
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
