P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation
Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe Wang, Siwei Ma, Wen, Gao

TL;DR
This paper proposes P-STMO, a two-stage pre-trained model for 3D human pose estimation that leverages self-supervised masked pose modeling and a many-to-one frame aggregation, achieving state-of-the-art accuracy with reduced computational cost.
Contribution
The paper introduces a novel pre-training strategy with masked pose modeling and a many-to-one aggregation for efficient 3D human pose estimation.
Findings
Outperforms state-of-the-art methods on benchmarks
Achieves 42.1mm MPJPE on Human3.6M dataset
Reduces computational overhead and model parameters
Abstract
This paper introduces a novel Pre-trained Spatial Temporal Many-to-One (P-STMO) model for 2D-to-3D human pose estimation task. To reduce the difficulty of capturing spatial and temporal information, we divide this task into two stages: pre-training (Stage I) and fine-tuning (Stage II). In Stage I, a self-supervised pre-training sub-task, termed masked pose modeling, is proposed. The human joints in the input sequence are randomly masked in both spatial and temporal domains. A general form of denoising auto-encoder is exploited to recover the original 2D poses and the encoder is capable of capturing spatial and temporal dependencies in this way. In Stage II, the pre-trained encoder is loaded to STMO model and fine-tuned. The encoder is followed by a many-to-one frame aggregator to predict the 3D pose in the current frame. Especially, an MLP block is utilized as the spatial feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Video Surveillance and Tracking Methods
MethodsNon Maximum Suppression · Convolution · Contour Proposal Network
