Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer

Shuai Peng; Di Fu; Baole Wei; Yong Cao; Liangcai Gao; Zhi Tang

arXiv:2408.17062·cs.CV·September 2, 2024

Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer

Shuai Peng, Di Fu, Baole Wei, Yong Cao, Liangcai Gao, Zhi Tang

PDF

Open Access

TL;DR

Vote&Mix (VoMix) is a plug-and-play, training-free token reduction technique for Vision Transformers that enhances computational efficiency by identifying and mixing similar tokens, significantly boosting speed with minimal accuracy loss.

Contribution

VoMix introduces a novel, parameter-free, layer-wise token similarity voting mechanism for efficient token reduction in ViTs without requiring retraining.

Findings

01

2x throughput increase for ViT-H on ImageNet-1K

02

2.4x throughput increase for ViT-L on Kinetics-400

03

Only 0.3% accuracy drop with speed improvements

Abstract

Despite the remarkable success of Vision Transformers (ViTs) in various visual tasks, they are often hindered by substantial computational cost. In this work, we introduce Vote\&Mix (\textbf{VoMix}), a plug-and-play and parameter-free token reduction method, which can be readily applied to off-the-shelf ViT models \textit{without any training}. VoMix tackles the computational redundancy of ViTs by identifying tokens with high homogeneity through a layer-wise token similarity voting mechanism. Subsequently, the selected tokens are mixed into the retained set, thereby preserving visual information. Experiments demonstrate VoMix significantly improves the speed-accuracy tradeoff of ViTs on both images and videos. Without any training, VoMix achieves a 2 $\times$ increase in throughput of existing ViT-H on ImageNet-1K and a 2.4 $\times$ increase in throughput of existing ViT-L on Kinetics-400…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Image Processing Techniques and Applications · Infrared Target Detection Methodologies