Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers
Sifan Long, Zhen Zhao, Jimin Pi, Shengsheng Wang, Jingdong, Wang

TL;DR
This paper introduces a novel token pruning method for vision transformers that considers both token importance and diversity, achieving significant computational savings with minimal accuracy loss or even slight improvements.
Contribution
It proposes a simple yet effective token decoupling and merging approach that jointly preserves important local tokens and maintains global token diversity.
Findings
Reduces FLOPs by 35% on DeiT-S with only 0.2% accuracy drop.
Improves DeiT-T accuracy by 0.1% while reducing FLOPs by 40%.
Maintains a promising balance between model complexity and classification performance.
Abstract
Vision transformers have achieved significant improvements on various vision tasks but their quadratic interactions between tokens significantly reduce computational efficiency. Many pruning methods have been proposed to remove redundant tokens for efficient vision transformers recently. However, existing studies mainly focus on the token importance to preserve local attentive tokens but completely ignore the global token diversity. In this paper, we emphasize the cruciality of diverse global semantics and propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning. According to the class token attention, we decouple the attentive and inattentive tokens. In addition to preserving the most discriminative local tokens, we merge similar inattentive tokens and match homogeneous attentive tokens to maximize the token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Visual Attention and Saliency Detection
MethodsPruning
