TL;DR
This paper introduces a method to learn a versatile 3D representation by training on foundational tasks like pose estimation and feature matching, enabling generalization to new 3D tasks and achieving state-of-the-art results.
Contribution
It presents a novel multi-task learning approach for 3D representation that generalizes to multiple tasks without fine-tuning and provides a large-scale dataset for further research.
Findings
Representation generalizes to novel 3D tasks without fine-tuning
Achieves state-of-the-art wide baseline feature matching
Performs camera pose estimation comparable to humans
Abstract
Though a large body of computer vision research has investigated developing generic semantic representations, efforts towards developing a similar representation for 3D has been limited. In this paper, we learn a generic 3D representation through solving a set of foundational proxy 3D tasks: object-centric camera pose estimation and wide baseline feature matching. Our method is based upon the premise that by providing supervision over a set of carefully selected foundational tasks, generalization to novel tasks and abstraction capabilities can be achieved. We empirically show that the internal representation of a multi-task ConvNet trained to solve the above core problems generalizes to novel 3D tasks (e.g., scene layout estimation, object pose estimation, surface normal estimation) without the need for fine-tuning and shows traits of abstraction abilities (e.g., cross-modality pose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
