Token Pooling in Vision Transformers

Dmitrii Marin; Jen-Hao Rick Chang; Anurag Ranjan; Anish Prabhu,; Mohammad Rastegari; Oncel Tuzel

arXiv:2110.03860·cs.CV·February 28, 2023·1 cites

Token Pooling in Vision Transformers

Dmitrii Marin, Jen-Hao Rick Chang, Anurag Ranjan, Anish Prabhu,, Mohammad Rastegari, Oncel Tuzel

PDF

Open Access

TL;DR

This paper introduces Token Pooling, a novel token downsampling method for vision transformers that reduces computational costs by exploiting redundancies, achieving similar accuracy with significantly fewer computations.

Contribution

The paper proposes Token Pooling, a new token downsampling technique that approximates tokens through clustering, improving the cost-accuracy trade-off in vision transformers.

Findings

01

Token Pooling reduces computation by 42% on DeiT with no accuracy loss.

02

Softmax-attention acts as a low-pass filter, enabling effective token redundancy pruning.

03

Token Pooling outperforms prior downsampling methods in efficiency and accuracy.

Abstract

Despite the recent success in many applications, the high computational requirements of vision transformers limit their use in resource-constrained settings. While many existing methods improve the quadratic complexity of attention, in most vision transformers, self-attention is not the major computation bottleneck, e.g., more than 80% of the computation is spent on fully-connected layers. To improve the computational complexity of all layers, we propose a novel token downsampling method, called Token Pooling, efficiently exploiting redundancies in the images and intermediate token representations. We show that, under mild assumptions, softmax-attention acts as a high-dimensional low-pass (smoothing) filter. Thus, its output contains redundancy that can be pruned to achieve a better trade-off between the computational cost and accuracy. Our new technique accurately approximates a set of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Advanced Image Fusion Techniques

MethodsAttention Is All You Need · Linear Layer · Softmax · Dense Connections · Multi-Head Attention · Dropout · Attention Dropout · Feedforward Network · Data-efficient Image Transformer