SpatioTemporal Learning for Human Pose Estimation in Sparsely-Labeled Videos
Yingying Jiao, Zhigang Wang, Sifan Wu, Shaojing Fan, Zhenguang Liu,, Zhuoyue Xu, Zheqi Wu

TL;DR
STDPose is a new framework that improves human pose estimation in sparsely-labeled videos by capturing long-range motion and spatiotemporal dynamics, achieving high accuracy with limited labeled data.
Contribution
It introduces a Dynamic-Aware Mask and spatiotemporal encoding system, setting new benchmarks in pose estimation with minimal labeled data.
Findings
Outperforms existing methods on large-scale datasets
Achieves competitive results with only 26.7% labeled data
Establishes new benchmarks for pose propagation and estimation
Abstract
Human pose estimation in videos remains a challenge, largely due to the reliance on extensive manual annotation of large datasets, which is expensive and labor-intensive. Furthermore, existing approaches often struggle to capture long-range temporal dependencies and overlook the complementary relationship between temporal pose heatmaps and visual features. To address these limitations, we introduce STDPose, a novel framework that enhances human pose estimation by learning spatiotemporal dynamics in sparsely-labeled videos. STDPose incorporates two key innovations: 1) A novel Dynamic-Aware Mask to capture long-range motion context, allowing for a nuanced understanding of pose changes. 2) A system for encoding and aggregating spatiotemporal representations and motion dynamics to effectively model spatiotemporal relationships, improving the accuracy and robustness of pose estimation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Advanced Vision and Imaging
