Token Shift Transformer for Video Classification

Hao Zhang; Yanbin Hao; Chong-Wah Ngo

arXiv:2108.02432·cs.CV·August 6, 2021

Token Shift Transformer for Video Classification

Hao Zhang, Yanbin Hao, Chong-Wah Ngo

PDF

3 Repos

TL;DR

This paper introduces TokShift, a zero-parameter, zero-FLOPs module that models temporal relations in video transformers, achieving state-of-the-art results efficiently without convolutional operations.

Contribution

The paper proposes a novel TokShift module that enhances transformer-based video classification by modeling temporal relations efficiently without additional parameters or FLOPs.

Findings

01

Achieves SOTA accuracy on Kinetics-400, EGTEA-Gaze+, and UCF-101 datasets.

02

Maintains high efficiency with zero additional computational cost.

03

Effectively models temporal relations in videos using a simple shift operation.

Abstract

Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals (e.g., NLP and Image Content Understanding). As a potential alternative to convolutional neural networks, it shares merits of strong interpretability, high discriminative power on hyper-scale data, and flexibility in processing varying length inputs. However, its encoders naturally contain computational intensive operations such as pair-wise self-attention, incurring heavy computational burden when being applied on the complex 3-dimensional video signals. This paper presents Token Shift Module (i.e., TokShift), a novel, zero-parameter, zero-FLOPs operator, for modeling temporal relations within each transformer encoder. Specifically, the TokShift barely temporally shifts partial [Class] token features back-and-forth across adjacent frames. Then, we densely plug the module into each encoder of a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Softmax · Residual Connection · Multi-Head Attention · Layer Normalization · Dense Connections · Vision Transformer