Efficient Attention-free Video Shift Transformers
Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

TL;DR
This paper introduces VAST, an attention-free video transformer using shift operations, achieving high efficiency and accuracy in video recognition tasks, outperforming existing models with lower computational costs.
Contribution
The paper presents the first attention-free shift-based video transformer, VAST, that approximates transformer operations and outperforms state-of-the-art models in efficiency and accuracy.
Findings
VAST outperforms recent transformers on action recognition benchmarks.
The Affine-Shift block achieves high accuracy with low computational cost.
VAST is the first purely shift-based video transformer.
Abstract
This paper tackles the problem of efficient video recognition. In this area, video transformers have recently dominated the efficiency (top-1 accuracy vs FLOPs) spectrum. At the same time, there have been some attempts in the image domain which challenge the necessity of the self-attention operation within the transformer architecture, advocating the use of simpler approaches for token mixing. However, there are no results yet for the case of video recognition, where the self-attention operator has a significantly higher impact (compared to the case of images) on efficiency. To address this gap, in this paper, we make the following contributions: (a) we construct a highly efficient \& accurate attention-free block based on the shift operator, coined Affine-Shift block, specifically designed to approximate as closely as possible the operations in the MHSA block of a Transformer layer.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Anomaly Detection Techniques and Applications · Brain Tumor Detection and Classification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Dense Connections · Residual Connection · Dropout · Softmax · Label Smoothing
