Token Pooling in Vision Transformers
Dmitrii Marin, Jen-Hao Rick Chang, Anurag Ranjan, Anish Prabhu,, Mohammad Rastegari, Oncel Tuzel

TL;DR
This paper introduces Token Pooling, a novel token downsampling method for vision transformers that reduces computational costs by exploiting redundancies, achieving similar accuracy with significantly fewer computations.
Contribution
The paper proposes Token Pooling, a new token downsampling technique that approximates tokens through clustering, improving the cost-accuracy trade-off in vision transformers.
Findings
Token Pooling reduces computation by 42% on DeiT with no accuracy loss.
Softmax-attention acts as a low-pass filter, enabling effective token redundancy pruning.
Token Pooling outperforms prior downsampling methods in efficiency and accuracy.
Abstract
Despite the recent success in many applications, the high computational requirements of vision transformers limit their use in resource-constrained settings. While many existing methods improve the quadratic complexity of attention, in most vision transformers, self-attention is not the major computation bottleneck, e.g., more than 80% of the computation is spent on fully-connected layers. To improve the computational complexity of all layers, we propose a novel token downsampling method, called Token Pooling, efficiently exploiting redundancies in the images and intermediate token representations. We show that, under mild assumptions, softmax-attention acts as a high-dimensional low-pass (smoothing) filter. Thus, its output contains redundancy that can be pruned to achieve a better trade-off between the computational cost and accuracy. Our new technique accurately approximates a set of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Advanced Image Fusion Techniques
MethodsAttention Is All You Need · Linear Layer · Softmax · Dense Connections · Multi-Head Attention · Dropout · Attention Dropout · Feedforward Network · Data-efficient Image Transformer
