Focal and Global Spatial-Temporal Transformer for Skeleton-based Action   Recognition

Zhimin Gao; Peitao Wang; Pei Lv; Xiaoheng Jiang; Qidong Liu; Pichao; Wang; Mingliang Xu; Wanqing Li

arXiv:2210.02693·cs.CV·October 7, 2022·5 cites

Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition

Zhimin Gao, Peitao Wang, Pei Lv, Xiaoheng Jiang, Qidong Liu, Pichao, Wang, Mingliang Xu, Wanqing Li

PDF

Open Access

TL;DR

This paper introduces FG-STFormer, a novel transformer model for skeleton-based action recognition that emphasizes discriminative local joints and short-range temporal dynamics, outperforming existing methods on multiple benchmarks.

Contribution

The paper proposes a new focal and global spatial-temporal transformer with joint and body part coupling and dilated temporal convolution, enhancing local and short-range temporal modeling.

Findings

01

Outperforms existing transformer-based methods on NTU-60, NTU-120, and NW-UCLA datasets.

02

Effectively models local joints and short-term temporal dynamics.

03

Achieves state-of-the-art results compared to GCN-based methods.

Abstract

Despite great progress achieved by transformer in various vision tasks, it is still underexplored for skeleton-based action recognition with only a few attempts. Besides, these methods directly calculate the pair-wise global self-attention equally for all the joints in both the spatial and temporal dimensions, undervaluing the effect of discriminative local joints and the short-range temporal dynamics. In this work, we propose a novel Focal and Global Spatial-Temporal Transformer network (FG-STFormer), that is equipped with two key components: (1) FG-SFormer: focal joints and global parts coupling spatial transformer. It forces the network to focus on modelling correlations for both the learned discriminative spatial joints and human body parts respectively. The selective focal joints eliminate the negative effect of non-informative ones during accumulating the correlations. Meanwhile,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Medical Imaging and Analysis · Gait Recognition and Analysis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Label Smoothing · Softmax · Convolution · Byte Pair Encoding · Adam · Dense Connections