Global Interaction Modelling in Vision Transformer via Super Tokens
Ammarah Farooq, Muhammad Awais, Sara Ahmed, Josef Kittler

TL;DR
This paper introduces a novel vision transformer architecture using Super tokens for efficient local and global information modeling, achieving high accuracy with fewer parameters and improved throughput.
Contribution
It proposes a Super token-based transformer architecture that simplifies global communication and reduces parameter count while maintaining high accuracy.
Findings
Achieves 83.5% accuracy on ImageNet-1K.
Uses approximately half the parameters of Swin-B.
Offers improved inference throughput.
Abstract
With the popularity of Transformer architectures in computer vision, the research focus has shifted towards developing computationally efficient designs. Window-based local attention is one of the major techniques being adopted in recent works. These methods begin with very small patch size and small embedding dimensions and then perform strided convolution (patch merging) in order to reduce the feature map size and increase embedding dimensions, hence, forming a pyramidal Convolutional Neural Network (CNN) like design. In this work, we investigate local and global information modelling in transformers by presenting a novel isotropic architecture that adopts local windows and special tokens, called Super tokens, for self-attention. Specifically, a single Super token is assigned to each image window which captures the rich local details for that window. These tokens are then employed for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Advanced Memory and Neural Computing
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Softmax · Residual Connection · Stochastic Depth
