TL;DR
This paper introduces TokShift, a zero-parameter, zero-FLOPs module that models temporal relations in video transformers, achieving state-of-the-art results efficiently without convolutional operations.
Contribution
The paper proposes a novel TokShift module that enhances transformer-based video classification by modeling temporal relations efficiently without additional parameters or FLOPs.
Findings
Achieves SOTA accuracy on Kinetics-400, EGTEA-Gaze+, and UCF-101 datasets.
Maintains high efficiency with zero additional computational cost.
Effectively models temporal relations in videos using a simple shift operation.
Abstract
Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals (e.g., NLP and Image Content Understanding). As a potential alternative to convolutional neural networks, it shares merits of strong interpretability, high discriminative power on hyper-scale data, and flexibility in processing varying length inputs. However, its encoders naturally contain computational intensive operations such as pair-wise self-attention, incurring heavy computational burden when being applied on the complex 3-dimensional video signals. This paper presents Token Shift Module (i.e., TokShift), a novel, zero-parameter, zero-FLOPs operator, for modeling temporal relations within each transformer encoder. Specifically, the TokShift barely temporally shifts partial [Class] token features back-and-forth across adjacent frames. Then, we densely plug the module into each encoder of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Softmax · Residual Connection · Multi-Head Attention · Layer Normalization · Dense Connections · Vision Transformer
