Spatiotemporal Self-attention Modeling with Temporal Patch Shift for   Action Recognition

Wangmeng Xiang; Chao Li; Biao Wang; Xihan Wei; Xian-Sheng Hua; Lei; Zhang

arXiv:2207.13259·cs.CV·July 28, 2022·1 cites

Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition

Wangmeng Xiang, Chao Li, Biao Wang, Xihan Wei, Xian-Sheng Hua, Lei, Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Temporal Patch Shift, a plug-and-play module that enables efficient 3D self-attention in transformers for video action recognition, achieving competitive accuracy with reduced computational and memory costs.

Contribution

The paper proposes a novel Temporal Patch Shift method that converts spatial self-attention into spatiotemporal self-attention with minimal additional cost, enhancing efficiency in video transformers.

Findings

01

Achieves competitive accuracy on action recognition benchmarks.

02

Reduces computation and memory costs compared to existing methods.

03

Easily integrable into existing 2D transformer architectures.

Abstract

Transformer-based methods have recently achieved great advancement on 2D image-based vision tasks. For 3D video-based tasks such as action recognition, however, directly applying spatiotemporal transformers on video data will bring heavy computation and memory burdens due to the largely increased number of patches and the quadratic complexity of self-attention computation. How to efficiently and effectively model the 3D self-attention of video data has been a great challenge for transformers. In this paper, we propose a Temporal Patch Shift (TPS) method for efficient 3D self-attention modeling in transformers for video-based action recognition. TPS shifts part of patches with a specific mosaic pattern in the temporal dimension, thus converting a vanilla spatial self-attention operation to a spatiotemporal one with little additional cost. As a result, we can compute 3D self-attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

martinxm/tps
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Neural Network Applications · Hand Gesture Recognition Systems