Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer
Shuai Peng, Di Fu, Baole Wei, Yong Cao, Liangcai Gao, Zhi Tang

TL;DR
Vote&Mix (VoMix) is a plug-and-play, training-free token reduction technique for Vision Transformers that enhances computational efficiency by identifying and mixing similar tokens, significantly boosting speed with minimal accuracy loss.
Contribution
VoMix introduces a novel, parameter-free, layer-wise token similarity voting mechanism for efficient token reduction in ViTs without requiring retraining.
Findings
2x throughput increase for ViT-H on ImageNet-1K
2.4x throughput increase for ViT-L on Kinetics-400
Only 0.3% accuracy drop with speed improvements
Abstract
Despite the remarkable success of Vision Transformers (ViTs) in various visual tasks, they are often hindered by substantial computational cost. In this work, we introduce Vote\&Mix (\textbf{VoMix}), a plug-and-play and parameter-free token reduction method, which can be readily applied to off-the-shelf ViT models \textit{without any training}. VoMix tackles the computational redundancy of ViTs by identifying tokens with high homogeneity through a layer-wise token similarity voting mechanism. Subsequently, the selected tokens are mixed into the retained set, thereby preserving visual information. Experiments demonstrate VoMix significantly improves the speed-accuracy tradeoff of ViTs on both images and videos. Without any training, VoMix achieves a 2 increase in throughput of existing ViT-H on ImageNet-1K and a 2.4 increase in throughput of existing ViT-L on Kinetics-400…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Image Processing Techniques and Applications · Infrared Target Detection Methodologies
