Multi-view Consistency as Supervisory Signal for Learning Shape and Pose Prediction
Shubham Tulsiani, Alexei A. Efros, Jitendra Malik

TL;DR
This paper introduces a self-supervised learning framework that leverages multi-view consistency to predict 3D shape and pose from single images without direct supervision, demonstrating effectiveness on synthetic and real-world datasets.
Contribution
The authors propose a novel training method that enforces geometric consistency across views to learn shape and pose prediction without explicit labels.
Findings
Achieves competitive performance on ShapeNet dataset.
Successfully applies to real-world online product images.
Learns shape in a canonical view without direct supervision.
Abstract
We present a framework for learning single-view shape and pose prediction without using direct supervision for either. Our approach allows leveraging multi-view observations from unknown poses as supervisory signal during training. Our proposed training setup enforces geometric consistency between the independently predicted shape and pose from two views of the same instance. We consequently learn to predict shape in an emergent canonical (view-agnostic) frame along with a corresponding pose predictor. We show empirical and qualitative results using the ShapeNet dataset and observe encouragingly competitive performance to previous techniques which rely on stronger forms of supervision. We also demonstrate the applicability of our framework in a realistic setting which is beyond the scope of existing techniques: using a training dataset comprised of online product images where the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
