Self-Supervised Multi-View Synchronization Learning for 3D Pose Estimation
Simon Jenni, Paolo Favaro

TL;DR
This paper introduces a self-supervised multi-view synchronization learning approach that leverages unlabeled video data to improve 3D human pose estimation, especially when only small annotated datasets are available.
Contribution
It proposes a novel self-supervised pre-training task based on multi-view synchronization to enhance 3D pose estimation models, reducing reliance on large labeled datasets.
Findings
Achieves state-of-the-art 3D pose estimation results on Human3.6M.
Effectively leverages unlabeled multi-view video data for pre-training.
Improves performance with limited annotated data.
Abstract
Current state-of-the-art methods cast monocular 3D human pose estimation as a learning problem by training neural networks on large data sets of images and corresponding skeleton poses. In contrast, we propose an approach that can exploit small annotated data sets by fine-tuning networks pre-trained via self-supervised learning on (large) unlabeled data sets. To drive such networks towards supporting 3D pose estimation during the pre-training step, we introduce a novel self-supervised feature learning task designed to focus on the 3D structure in an image. We exploit images extracted from videos captured with a multi-view camera system. The task is to classify whether two images depict two views of the same scene up to a rigid transformation. In a multi-view data set, where objects deform in a non-rigid manner, a rigid transformation occurs only between two views taken at the exact same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
