Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition
Yujun Ma, Benjia Zhou, Ruili Wang, Pichao Wang

TL;DR
This paper introduces a novel multi-stage factorized spatio-temporal architecture for RGB-D action and gesture recognition, effectively capturing fine-grained motion and hierarchical features to improve recognition accuracy.
Contribution
The paper proposes the MFST model with CDC-Stem and hierarchical spatio-temporal stages, advancing RGB-D recognition by better capturing motion details and semantic primitives.
Findings
Outperforms state-of-the-art on benchmark datasets
Enhances fine-grained temporal perception
Improves hierarchical spatio-temporal feature extraction
Abstract
RGB-D action and gesture recognition remain an interesting topic in human-centered scene understanding, primarily due to the multiple granularities and large variation in human motion. Although many RGB-D based action and gesture recognition approaches have demonstrated remarkable results by utilizing highly integrated spatio-temporal representations across multiple modalities (i.e., RGB and depth data), they still encounter several challenges. Firstly, vanilla 3D convolution makes it hard to capture fine-grained motion differences between local clips under different modalities. Secondly, the intricate nature of highly integrated spatio-temporal modeling can lead to optimization difficulties. Thirdly, duplicate and unnecessary information can add complexity and complicate entangled spatio-temporal modeling. To address the above issues, we propose an innovative heuristic architecture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Video Surveillance and Tracking Methods
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Layer Normalization · Dense Connections · Absolute Position Encodings · Residual Connection
