TL;DR
This paper introduces a novel Spatial-Temporal Transformer network (ST-TR) for skeleton-based action recognition, effectively modeling joint dependencies through self-attention mechanisms to improve recognition accuracy.
Contribution
The paper presents a new Transformer-based architecture with spatial and temporal self-attention modules for enhanced skeleton-based action recognition.
Findings
Outperforms state-of-the-art models on NTU-RGB+D datasets
Effectively models intra-frame and inter-frame dependencies
Uses a two-stream network architecture
Abstract
Skeleton-based human action recognition has achieved a great interest in recent years, as skeleton data has been demonstrated to be robust to illumination changes, body scales, dynamic camera views, and complex background. Nevertheless, an effective encoding of the latent information underlying the 3D skeleton is still an open problem. In this work, we propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator. In our ST-TR model, a Spatial Self-Attention module (SSA) is used to understand intra-frame interactions between different body parts, and a Temporal Self-Attention module (TSA) to model inter-frame correlations. The two are combined in a two-stream network which outperforms state-of-the-art models using the same input data on both NTU-RGB+D 60 and NTU-RGB+D 120.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Residual Connection · Dropout · Attention Is All You Need · Byte Pair Encoding · Layer Normalization · Dense Connections
