TL;DR
CanonPose introduces a self-supervised method for monocular 3D human pose estimation that leverages multi-view consistency without requiring camera calibration, enabling learning from unlabeled, moving-camera data in diverse environments.
Contribution
It presents a novel self-supervised framework that disentangles 3D pose and camera rotation from unlabeled multi-view images without calibration, including an extension for static camera setups.
Findings
Achieves competitive results on Human3.6M and MPII-INF-3DHP datasets.
Successfully applies to in-the-wild SkiPose dataset.
Does not require camera calibration or labeled 3D data.
Abstract
Human pose estimation from single images is a challenging problem in computer vision that requires large amounts of labeled training data to be solved accurately. Unfortunately, for many human activities (\eg outdoor sports) such training data does not exist and is hard or even impossible to acquire with traditional motion capture systems. We propose a self-supervised approach that learns a single image 3D pose estimator from unlabeled multi-view data. To this end, we exploit multi-view consistency constraints to disentangle the observed 2D pose into the underlying 3D pose and camera rotation. In contrast to most existing methods, we do not require calibrated cameras and can therefore learn from moving cameras. Nevertheless, in the case of a static camera setup, we present an optional extension to include constant relative camera rotations over multiple views into our framework. Key to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
