Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention
Xiangcheng Liu, Tianyi Wu, Guodong Guo

TL;DR
This paper introduces an adaptive token pruning method for Vision Transformers that dynamically discards unimportant tokens, significantly improving inference speed with minimal accuracy loss.
Contribution
It proposes a learnable, threshold-based token pruning framework that adaptively balances accuracy and computational complexity during inference.
Findings
Increases DeiT-S throughput by 50%
Maintains top-1 accuracy with only 0.2% drop
Outperforms previous pruning methods in accuracy-latency trade-off
Abstract
Vision transformer has emerged as a new paradigm in computer vision, showing excellent performance while accompanied by expensive computational cost. Image token pruning is one of the main approaches for ViT compression, due to the facts that the complexity is quadratic with respect to the token number, and many tokens containing only background regions do not truly contribute to the final prediction. Existing works either rely on additional modules to score the importance of individual tokens, or implement a fixed ratio pruning strategy for different input instances. In this work, we propose an adaptive sparse token pruning framework with a minimal cost. Specifically, we firstly propose an inexpensive attention head importance weighted class attention scoring mechanism. Then, learnable parameters are inserted as thresholds to distinguish informative tokens from unimportant ones. By…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques
MethodsPruning · Class Attention
