Self-supervised Learning of Pose Embeddings from Spatiotemporal Relations in Videos
\"Omer S\"umer, Tobias Dencker, Bj\"orn Ommer

TL;DR
This paper introduces a self-supervised approach for learning human pose embeddings from videos by leveraging spatiotemporal relations, reducing reliance on manual annotations and improving pose analysis tasks.
Contribution
It proposes a novel self-supervised learning method using spatiotemporal cues and curriculum learning to train pose embeddings without manual labels.
Findings
Embeddings improve human pose estimation accuracy.
Method outperforms some supervised approaches on benchmark datasets.
Repetitive pose mining enhances training reliability.
Abstract
Human pose analysis is presently dominated by deep convolutional networks trained with extensive manual annotations of joint locations and beyond. To avoid the need for expensive labeling, we exploit spatiotemporal relations in training videos for self-supervised learning of pose embeddings. The key idea is to combine temporal ordering and spatial placement estimation as auxiliary tasks for learning pose similarities in a Siamese convolutional network. Since the self-supervised sampling of both tasks from natural videos can result in ambiguous and incorrect training labels, our method employs a curriculum learning idea that starts training with the most reliable data samples and gradually increases the difficulty. To further refine the training process we mine repetitive poses in individual videos which provide reliable labels while removing inconsistencies. Our pose embeddings capture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
