Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition
Ryota Hashiguchi, Toru Tamaki

TL;DR
This paper introduces MSCA, a novel multi-head self/cross-attention mechanism for Vision Transformers, which enhances action recognition by better exploiting temporal information with minimal additional computation.
Contribution
It proposes MSCA, a new attention structure that replaces MSA in ViT, fully utilizing temporal shifts for improved action recognition performance.
Findings
MSCA-KV outperforms TokenShift by 0.1% on Kinetics400.
MSCA improves over standard ViT by 1.2%.
The method maintains the same computational cost as existing models.
Abstract
Feature shifts have been shown to be useful for action recognition with CNN-based models since Temporal Shift Module (TSM) was proposed. It is based on frame-wise feature extraction with late fusion, and layer features are shifted along the time direction for the temporal interaction. TokenShift, a recent model based on Vision Transformer (ViT), also uses the temporal feature shift mechanism, which, however, does not fully exploit the structure of Multi-head Self-Attention (MSA) in ViT. In this paper, we propose Multi-head Self/Cross-Attention (MSCA), which fully utilizes the attention structure. TokenShift is based on a frame-wise ViT with features temporally shifted with successive frames (at time t+1 and t-1). In contrast, the proposed MSCA replaces MSA in the frame-wise ViT, and some MSA heads attend to successive frames instead of the current frame. The computation cost is the same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Surveillance and Tracking Methods
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Softmax · Layer Normalization · Label Smoothing · Byte Pair Encoding · Position-Wise Feed-Forward Layer
