TL;DR
This paper introduces a novel Spatial-Temporal Transformer network for skeleton-based action recognition, leveraging self-attention mechanisms to better model joint dependencies and improve recognition accuracy across multiple large-scale datasets.
Contribution
The paper proposes a new ST-TR model that uses spatial and temporal self-attention modules to enhance the encoding of 3D skeleton data for action recognition.
Findings
Achieves state-of-the-art results on NTU-RGB+D 60 and 120 datasets.
Performs on par with state-of-the-art when incorporating bones information.
Consistently improves backbone models across multiple datasets.
Abstract
Skeleton-based Human Activity Recognition has achieved great interest in recent years as skeleton data has demonstrated being robust to illumination changes, body scales, dynamic camera views, and complex background. In particular, Spatial-Temporal Graph Convolutional Networks (ST-GCN) demonstrated to be effective in learning both spatial and temporal dependencies on non-Euclidean data such as skeleton graphs. Nevertheless, an effective encoding of the latent information underlying the 3D skeleton is still an open problem, especially when it comes to extracting effective information from joint motion patterns and their correlations. In this work, we propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator. In our ST-TR model, a Spatial Self-Attention module (SSA) is used to understand intra-frame…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Graph Convolutional Networks · Multi-Head Attention · Layer Normalization · Attention Is All You Need · Byte Pair Encoding · Dropout · Label Smoothing
