Masked Motion Predictors are Strong 3D Action Representation Learners
Yunyao Mao, Jiajun Deng, Wengang Zhou, Yao Fang, Wanli Ouyang,, Houqiang Li

TL;DR
This paper introduces MAMP, a self-supervised pre-training framework for 3D human action recognition that predicts motion in masked skeleton sequences, significantly enhancing transformer performance on benchmark datasets.
Contribution
The paper proposes a novel masked motion prediction framework that emphasizes explicit motion modeling over traditional component reconstruction for better 3D action representation.
Findings
MAMP improves transformer-based models on NTU-60, NTU-120, and PKU-MMD datasets.
It achieves state-of-the-art results without additional bells and whistles.
Motion prediction as a pretext task enhances semantic focus in skeleton sequences.
Abstract
In 3D human action recognition, limited supervised data makes it challenging to fully tap into the modeling potential of powerful networks such as transformers. As a result, researchers have been actively investigating effective self-supervised pre-training strategies. In this work, we show that instead of following the prevalent pretext task to perform masked self-component reconstruction in human joints, explicit contextual motion modeling is key to the success of learning effective feature representation for 3D action recognition. Formally, we propose the Masked Motion Prediction (MAMP) framework. To be specific, the proposed MAMP takes as input the masked spatio-temporal skeleton sequence and predicts the corresponding temporal motion of the masked human joints. Considering the high temporal redundancy of the skeleton sequence, in our MAMP, the motion information also acts as an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Multimodal Machine Learning Applications
