Skeleton-based Action Recognition via Spatial and Temporal Transformer   Networks

Chiara Plizzari; Marco Cannici; Matteo Matteucci

arXiv:2008.07404·cs.CV·June 23, 2021

Skeleton-based Action Recognition via Spatial and Temporal Transformer Networks

Chiara Plizzari, Marco Cannici, Matteo Matteucci

PDF

1 Repo

TL;DR

This paper introduces a novel Spatial-Temporal Transformer network for skeleton-based action recognition, leveraging self-attention mechanisms to better model joint dependencies and improve recognition accuracy across multiple large-scale datasets.

Contribution

The paper proposes a new ST-TR model that uses spatial and temporal self-attention modules to enhance the encoding of 3D skeleton data for action recognition.

Findings

01

Achieves state-of-the-art results on NTU-RGB+D 60 and 120 datasets.

02

Performs on par with state-of-the-art when incorporating bones information.

03

Consistently improves backbone models across multiple datasets.

Abstract

Skeleton-based Human Activity Recognition has achieved great interest in recent years as skeleton data has demonstrated being robust to illumination changes, body scales, dynamic camera views, and complex background. In particular, Spatial-Temporal Graph Convolutional Networks (ST-GCN) demonstrated to be effective in learning both spatial and temporal dependencies on non-Euclidean data such as skeleton graphs. Nevertheless, an effective encoding of the latent information underlying the 3D skeleton is still an open problem, especially when it comes to extracting effective information from joint motion patterns and their correlations. In this work, we propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator. In our ST-TR model, a Spatial Self-Attention module (SSA) is used to understand intra-frame…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Chiaraplizz/ST-TR
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Graph Convolutional Networks · Multi-Head Attention · Layer Normalization · Attention Is All You Need · Byte Pair Encoding · Dropout · Label Smoothing