ToSA: Token Selective Attention for Efficient Vision Transformers
Manish Kumar Singh, Rajeev Yasarla, Hong Cai, Mingu Lee, Fatih Porikli

TL;DR
ToSA introduces a token selective attention mechanism that reduces computational costs in vision transformers by dynamically selecting important tokens for attention, maintaining accuracy across classification and dense prediction tasks.
Contribution
The paper presents ToSA, a novel token selection method that efficiently reduces computation in vision transformers while preserving performance.
Findings
Significantly reduces computation and memory costs in vision transformers.
Maintains high accuracy on ImageNet classification.
Achieves comparable depth estimation accuracy with lighter models.
Abstract
In this paper, we propose a novel token selective attention approach, ToSA, which can identify tokens that need to be attended as well as those that can skip a transformer layer. More specifically, a token selector parses the current attention maps and predicts the attention maps for the next layer, which are then used to select the important tokens that should participate in the attention operation. The remaining tokens simply bypass the next layer and are concatenated with the attended ones to re-form a complete set of tokens. In this way, we reduce the quadratic computation and memory costs as fewer tokens participate in self-attention while maintaining the features for all the image patches throughout the network, which allows it to be used for dense prediction tasks. Our experiments show that by applying ToSA, we can significantly reduce computation costs while maintaining accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Advanced Neural Network Applications · CCD and CMOS Imaging Sensors
MethodsSparse Evolutionary Training
