Learning Coupled Spatial-temporal Attention for Skeleton-based Action Recognition
Jiayun Wang

TL;DR
This paper introduces a coupled spatial-temporal attention model for skeleton-based action recognition, which identifies the most informative joints and frames to improve recognition accuracy, and can be integrated into existing CNN architectures.
Contribution
The paper proposes a novel coupled spatial-temporal attention mechanism that learns to focus on important joints and frames simultaneously, enhancing skeleton-based action recognition.
Findings
Effective on UESTC and NTU datasets
Improves recognition accuracy with attention mechanism
Compatible with existing CNN models
Abstract
In this paper, we propose a coupled spatial-temporal attention (CSTA) model for skeleton-based action recognition, which aims to figure out the most discriminative joints and frames in spatial and temporal domains simultaneously. Conventional approaches usually consider all the joints or frames in a skeletal sequence equally important, which are unrobust to ambiguous and redundant information. To address this, we first learn two sets of weights for different joints and frames through two subnetworks respectively, which enable the model to have the ability of "paying attention to" the relatively informative section. Then, we calculate the cross product based on the weights of joints and frames for the coupled spatial-temporal attention. Moreover, our CSTA mechanisms can be easily plugged into existing hierarchical CNN models (CSTA-CNN) to realize their function. Extensive experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Gait Recognition and Analysis · Hand Gesture Recognition Systems
