Not All Patches are What You Need: Expediting Vision Transformers via   Token Reorganizations

Youwei Liang; Chongjian Ge; Zhan Tong; Yibing Song; Jue Wang; Pengtao; Xie

arXiv:2202.07800·cs.CV·April 15, 2022·95 cites

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, Pengtao, Xie

PDF

Open Access 2 Repos

TL;DR

This paper introduces EViT, a method that reorganizes image tokens during training to focus on attentive tokens, reducing computation and enabling higher resolution inputs for Vision Transformers, with minimal accuracy loss.

Contribution

EViT is the first approach to reorganize tokens during training, improving inference speed and accuracy by focusing on attentive tokens without adding parameters.

Findings

01

50% faster inference speed with 0.3% accuracy drop on DeiT-S

02

Enables higher resolution inputs at same computational cost

03

Effective on standard benchmarks

Abstract

Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head self-attention (MHSA) among them. Complete leverage of these image tokens brings redundant computations since not all the tokens are attentive in MHSA. Examples include that tokens containing semantically meaningless or distractive image backgrounds do not positively contribute to the ViT predictions. In this work, we propose to reorganize image tokens during the feed-forward process of ViT models, which is integrated into ViT during training. For each forward inference, we identify the attentive image tokens between MHSA and FFN (i.e., feed-forward network) modules, which is guided by the corresponding class token attention. Then, we reorganize image tokens by preserving attentive image tokens and fusing inattentive ones to expedite subsequent MHSA and FFN computations. To this end, our method EViT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Brain Tumor Detection and Classification

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings