Vision Transformer with Cross-attention by Temporal Shift for Efficient   Action Recognition

Ryota Hashiguchi; Toru Tamaki

arXiv:2204.00452·cs.CV·November 15, 2022·1 cites

Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition

Ryota Hashiguchi, Toru Tamaki

PDF

Open Access

TL;DR

This paper introduces MSCA, a novel multi-head self/cross-attention mechanism for Vision Transformers, which enhances action recognition by better exploiting temporal information with minimal additional computation.

Contribution

It proposes MSCA, a new attention structure that replaces MSA in ViT, fully utilizing temporal shifts for improved action recognition performance.

Findings

01

MSCA-KV outperforms TokenShift by 0.1% on Kinetics400.

02

MSCA improves over standard ViT by 1.2%.

03

The method maintains the same computational cost as existing models.

Abstract

Feature shifts have been shown to be useful for action recognition with CNN-based models since Temporal Shift Module (TSM) was proposed. It is based on frame-wise feature extraction with late fusion, and layer features are shifted along the time direction for the temporal interaction. TokenShift, a recent model based on Vision Transformer (ViT), also uses the temporal feature shift mechanism, which, however, does not fully exploit the structure of Multi-head Self-Attention (MSA) in ViT. In this paper, we propose Multi-head Self/Cross-Attention (MSCA), which fully utilizes the attention structure. TokenShift is based on a frame-wise ViT with features temporally shifted with successive frames (at time t+1 and t-1). In contrast, the proposed MSCA replaces MSA in the frame-wise ViT, and some MSA heads attend to successive frames instead of the current frame. The computation cost is the same…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Surveillance and Tracking Methods

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Softmax · Layer Normalization · Label Smoothing · Byte Pair Encoding · Position-Wise Feed-Forward Layer