Unsupervised Monocular 3D Keypoint Discovery from Multi-View Diffusion Priors
Subin Jeon, In Cho, Junyoung Hong, Seon Joo Kim

TL;DR
KeyDiff3D is an unsupervised framework that estimates 3D keypoints from a single image by leveraging multi-view diffusion priors, eliminating the need for manual annotations or multi-view data.
Contribution
It introduces a novel method that uses a pretrained multi-view diffusion model to generate multi-view cues from a single image for 3D keypoint estimation.
Findings
Achieves accurate 3D keypoints on diverse datasets.
Demonstrates strong generalization to in-the-wild images.
Enables manipulation of 3D objects from a single image.
Abstract
This paper introduces KeyDiff3D, a framework for unsupervised monocular 3D keypoints estimation that accurately predicts 3D keypoints from a single image. While previous methods rely on manual annotations or calibrated multi-view images, both of which are expensive to collect, our method enables monocular 3D keypoints estimation using only a collection of single-view images. To achieve this, we leverage powerful geometric priors embedded in a pretrained multi-view diffusion model. In our framework, this model generates multi-view images from a single image, serving as a supervision signal to provide 3D geometric cues to our model. We also use the diffusion model as a powerful 2D multi-view feature extractor and construct 3D feature volumes from its intermediate representations. This transforms implicit 3D priors learned by the diffusion model into explicit 3D features. Beyond accurate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
