TL;DR
This paper introduces an unsupervised method for 3D human pose estimation from a single view by leveraging pre-trained 2D diffusion models and multi-view ancestral sampling, achieving superior cross-domain results.
Contribution
It proposes a novel conditional multi-view ancestral sampling technique that estimates 3D human pose without 3D supervision, utilizing 2D diffusion priors and multi-view consistency.
Findings
Outperforms state-of-the-art supervised and unsupervised methods on Yoga dataset.
Effective in estimating extreme human poses without 3D supervision.
Demonstrates cross-domain generalization capabilities.
Abstract
We propose a method of estimating a 3D human pose from a single view without 3D supervision. The key to our method is to leverage the 2D diffusion priors of motion diffusion models (MDMs) pre-trained on large 2D human pose datasets. Specifically, we extend multi-view ancestral sampling of diffusion models to the task of 2D-3D lifting of human pose. To this end, we newly propose a conditional multi-view ancestral sampling (cMAS) that optimizes the 3D pose such that its multi-view projections follow the manifold in 2D MDM noise space, while conditioning the 3D pose to match the given 2D poses and anatomical constraints of humans. Experiments on the Yoga dataset demonstrate that our method achieves better cross-domain performance compared to state-of-the-art supervised and unsupervised 3D pose estimation methods, including extreme human poses where 3D supervision is unavailable. Code is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
