P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose   Estimation

Wenkang Shan; Zhenhua Liu; Xinfeng Zhang; Shanshe Wang; Siwei Ma; Wen; Gao

arXiv:2203.07628·cs.CV·August 1, 2022·6 cites

P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation

Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe Wang, Siwei Ma, Wen, Gao

PDF

Open Access 1 Repo

TL;DR

This paper proposes P-STMO, a two-stage pre-trained model for 3D human pose estimation that leverages self-supervised masked pose modeling and a many-to-one frame aggregation, achieving state-of-the-art accuracy with reduced computational cost.

Contribution

The paper introduces a novel pre-training strategy with masked pose modeling and a many-to-one aggregation for efficient 3D human pose estimation.

Findings

01

Outperforms state-of-the-art methods on benchmarks

02

Achieves 42.1mm MPJPE on Human3.6M dataset

03

Reduces computational overhead and model parameters

Abstract

This paper introduces a novel Pre-trained Spatial Temporal Many-to-One (P-STMO) model for 2D-to-3D human pose estimation task. To reduce the difficulty of capturing spatial and temporal information, we divide this task into two stages: pre-training (Stage I) and fine-tuning (Stage II). In Stage I, a self-supervised pre-training sub-task, termed masked pose modeling, is proposed. The human joints in the input sequence are randomly masked in both spatial and temporal domains. A general form of denoising auto-encoder is exploited to recover the original 2D poses and the encoder is capable of capturing spatial and temporal dependencies in this way. In Stage II, the pre-trained encoder is loaded to STMO model and fine-tuned. The encoder is followed by a many-to-one frame aggregator to predict the 3D pose in the current frame. Especially, an MLP block is utilized as the spatial feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

patrick-swk/p-stmo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Video Surveillance and Tracking Methods

MethodsNon Maximum Suppression · Convolution · Contour Proposal Network