TL;DR
This paper introduces a differentiable, parameter-free Adaptive Token Sampler (ATS) that can be integrated into vision transformers to dynamically select significant tokens, reducing computational costs by 2X while maintaining accuracy.
Contribution
The novel ATS module enables adaptive token sampling in vision transformers, improving efficiency without additional training or parameters, and can be added to existing models as a plug-and-play component.
Findings
Reduces GFLOPs by 2X on multiple vision transformers
Maintains state-of-the-art accuracy on ImageNet and Kinetics datasets
Can be integrated into pre-trained models without retraining
Abstract
While state-of-the-art vision transformer models achieve promising results in image classification, they are computationally expensive and require many GFLOPs. Although the GFLOPs of a vision transformer can be decreased by reducing the number of tokens in the network, there is no setting that is optimal for all input images. In this work, we therefore introduce a differentiable parameter-free Adaptive Token Sampler (ATS) module, which can be plugged into any existing vision transformer architecture. ATS empowers vision transformers by scoring and adaptively sampling significant tokens. As a result, the number of tokens is not constant anymore and varies for each input image. By integrating ATS as an additional layer within the current transformer blocks, we can convert them into much more efficient vision transformers with an adaptive number of tokens. Since ATS is a parameter-free…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Residual Connection · Dense Connections · Layer Normalization · Vision Transformer
