Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action   and Gesture Recognition

Yujun Ma; Benjia Zhou; Ruili Wang; Pichao Wang

arXiv:2308.12006·cs.CV·September 12, 2023·2 cites

Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition

Yujun Ma, Benjia Zhou, Ruili Wang, Pichao Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel multi-stage factorized spatio-temporal architecture for RGB-D action and gesture recognition, effectively capturing fine-grained motion and hierarchical features to improve recognition accuracy.

Contribution

The paper proposes the MFST model with CDC-Stem and hierarchical spatio-temporal stages, advancing RGB-D recognition by better capturing motion details and semantic primitives.

Findings

01

Outperforms state-of-the-art on benchmark datasets

02

Enhances fine-grained temporal perception

03

Improves hierarchical spatio-temporal feature extraction

Abstract

RGB-D action and gesture recognition remain an interesting topic in human-centered scene understanding, primarily due to the multiple granularities and large variation in human motion. Although many RGB-D based action and gesture recognition approaches have demonstrated remarkable results by utilizing highly integrated spatio-temporal representations across multiple modalities (i.e., RGB and depth data), they still encounter several challenges. Firstly, vanilla 3D convolution makes it hard to capture fine-grained motion differences between local clips under different modalities. Secondly, the intricate nature of highly integrated spatio-temporal modeling can lead to optimization difficulties. Thirdly, duplicate and unnecessary information can add complexity and complicate entangled spatio-temporal modeling. To address the above issues, we propose an innovative heuristic architecture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

damo-cv/motionrgbd
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Video Surveillance and Tracking Methods

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Layer Normalization · Dense Connections · Absolute Position Encodings · Residual Connection