Global Interaction Modelling in Vision Transformer via Super Tokens

Ammarah Farooq; Muhammad Awais; Sara Ahmed; Josef Kittler

arXiv:2111.13156·cs.CV·November 29, 2021

Global Interaction Modelling in Vision Transformer via Super Tokens

Ammarah Farooq, Muhammad Awais, Sara Ahmed, Josef Kittler

PDF

Open Access

TL;DR

This paper introduces a novel vision transformer architecture using Super tokens for efficient local and global information modeling, achieving high accuracy with fewer parameters and improved throughput.

Contribution

It proposes a Super token-based transformer architecture that simplifies global communication and reduces parameter count while maintaining high accuracy.

Findings

01

Achieves 83.5% accuracy on ImageNet-1K.

02

Uses approximately half the parameters of Swin-B.

03

Offers improved inference throughput.

Abstract

With the popularity of Transformer architectures in computer vision, the research focus has shifted towards developing computationally efficient designs. Window-based local attention is one of the major techniques being adopted in recent works. These methods begin with very small patch size and small embedding dimensions and then perform strided convolution (patch merging) in order to reduce the feature map size and increase embedding dimensions, hence, forming a pyramidal Convolutional Neural Network (CNN) like design. In this work, we investigate local and global information modelling in transformers by presenting a novel isotropic architecture that adopts local windows and special tokens, called Super tokens, for self-attention. Specifically, a single Super token is assigned to each image window which captures the rich local details for that window. These tokens are then employed for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Advanced Memory and Neural Computing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Softmax · Residual Connection · Stochastic Depth