Spatial Temporal Transformer Network for Skeleton-based Action   Recognition

Chiara Plizzari; Marco Cannici; Matteo Matteucci

arXiv:2012.06399·cs.CV·June 24, 2021

Spatial Temporal Transformer Network for Skeleton-based Action Recognition

Chiara Plizzari, Marco Cannici, Matteo Matteucci

PDF

1 Repo

TL;DR

This paper introduces a novel Spatial-Temporal Transformer network (ST-TR) for skeleton-based action recognition, effectively modeling joint dependencies through self-attention mechanisms to improve recognition accuracy.

Contribution

The paper presents a new Transformer-based architecture with spatial and temporal self-attention modules for enhanced skeleton-based action recognition.

Findings

01

Outperforms state-of-the-art models on NTU-RGB+D datasets

02

Effectively models intra-frame and inter-frame dependencies

03

Uses a two-stream network architecture

Abstract

Skeleton-based human action recognition has achieved a great interest in recent years, as skeleton data has been demonstrated to be robust to illumination changes, body scales, dynamic camera views, and complex background. Nevertheless, an effective encoding of the latent information underlying the 3D skeleton is still an open problem. In this work, we propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator. In our ST-TR model, a Spatial Self-Attention module (SSA) is used to understand intra-frame interactions between different body parts, and a Temporal Self-Attention module (TSA) to model inter-frame correlations. The two are combined in a two-stream network which outperforms state-of-the-art models using the same input data on both NTU-RGB+D 60 and NTU-RGB+D 120.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Chiaraplizz/ST-TR
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Residual Connection · Dropout · Attention Is All You Need · Byte Pair Encoding · Layer Normalization · Dense Connections