TL;DR
This paper introduces 3D INterpreter Network (3D-INN), an end-to-end model that estimates 3D object structure from a single image by leveraging both real and synthetic data through a novel projection layer and keypoint heatmaps.
Contribution
The paper presents a new framework that combines real and synthetic data for 3D structure estimation using a projection layer and heatmaps as intermediate representations.
Findings
Achieves state-of-the-art results in 2D keypoint estimation.
Demonstrates accurate 3D structure recovery from single images.
Enables applications like 3D rendering and image retrieval.
Abstract
Understanding 3D object structure from a single image is an important but difficult task in computer vision, mostly due to the lack of 3D object annotations in real images. Previous work tackles this problem by either solving an optimization task given 2D keypoint positions, or training on synthetic data with ground truth 3D information. In this work, we propose 3D INterpreter Network (3D-INN), an end-to-end framework which sequentially estimates 2D keypoint heatmaps and 3D object structure, trained on both real 2D-annotated images and synthetic 3D data. This is made possible mainly by two technical innovations. First, we propose a Projection Layer, which projects estimated 3D structure to 2D space, so that 3D-INN can be trained to predict 3D structural parameters supervised by 2D annotations on real images. Second, heatmaps of keypoints serve as an intermediate representation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
