Adaptive Token Sampling For Efficient Vision Transformers

Mohsen Fayyaz; Soroush Abbasi Koohpayegani; Farnoush Rezaei Jafari,; Sunando Sengupta; Hamid Reza Vaezi Joze; Eric Sommerlade; Hamed Pirsiavash,; Juergen Gall

arXiv:2111.15667·cs.CV·July 27, 2022

Adaptive Token Sampling For Efficient Vision Transformers

Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari,, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash,, Juergen Gall

PDF

1 Repo

TL;DR

This paper introduces a differentiable, parameter-free Adaptive Token Sampler (ATS) that can be integrated into vision transformers to dynamically select significant tokens, reducing computational costs by 2X while maintaining accuracy.

Contribution

The novel ATS module enables adaptive token sampling in vision transformers, improving efficiency without additional training or parameters, and can be added to existing models as a plug-and-play component.

Findings

01

Reduces GFLOPs by 2X on multiple vision transformers

02

Maintains state-of-the-art accuracy on ImageNet and Kinetics datasets

03

Can be integrated into pre-trained models without retraining

Abstract

While state-of-the-art vision transformer models achieve promising results in image classification, they are computationally expensive and require many GFLOPs. Although the GFLOPs of a vision transformer can be decreased by reducing the number of tokens in the network, there is no setting that is optimal for all input images. In this work, we therefore introduce a differentiable parameter-free Adaptive Token Sampler (ATS) module, which can be plugged into any existing vision transformer architecture. ATS empowers vision transformers by scoring and adaptively sampling significant tokens. As a result, the number of tokens is not constant anymore and varies for each input image. By integrating ATS as an additional layer within the current transformer blocks, we can convert them into much more efficient vision transformers with an adaptive number of tokens. Since ATS is a parameter-free…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

adaptivetokensampling/ATS
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Residual Connection · Dense Connections · Layer Normalization · Vision Transformer