No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling
Xuwei Xu, Changlin Li, Yudong Chen, Xiaojun Chang, Jiajun Liu, Sen, Wang

TL;DR
IdleViT introduces a dynamic token idling method for Vision Transformers that reduces computational complexity by selectively keeping tokens idle, allowing re-selection in later layers, and uses a token cut loss for better token selection, achieving high efficiency with minimal accuracy loss.
Contribution
This paper proposes IdleViT, a novel dynamic token idling approach with a token cut loss, enabling efficient ViT inference without permanently dropping tokens, unlike prior pruning methods.
Findings
Reduces ViT complexity by up to 33% with only 0.2% accuracy loss.
Outperforms state-of-the-art EViT at a 0.5 keep ratio.
Achieves faster inference speed with minimal performance impact.
Abstract
Vision Transformers (ViTs) have demonstrated outstanding performance in computer vision tasks, yet their high computational complexity prevents their deployment in computing resource-constrained environments. Various token pruning techniques have been introduced to alleviate the high computational burden of ViTs by dynamically dropping image tokens. However, some undesirable pruning at early stages may result in permanent loss of image information in subsequent layers, consequently hindering model performance. To address this problem, we propose IdleViT, a dynamic token-idle-based method that achieves an excellent trade-off between performance and efficiency. Specifically, in each layer, IdleViT selects a subset of the image tokens to participate in computations while keeping the rest of the tokens idle and directly passing them to this layer's output. By allowing the idle tokens to be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · CCD and CMOS Imaging Sensors
MethodsPruning
