TL;DR
This paper introduces a semantic keypoint-based method for estimating the 6-DoF pose of objects from single RGB images, effective for textured and textureless objects, with minimal manual labeling.
Contribution
It presents a novel approach combining semantic keypoints with a deformable shape model, and a semi-automatic data generation technique for training with minimal manual effort.
Findings
Accurately recovers 6-DoF pose in cluttered scenes
Performs well on multiple large-scale datasets
Achieves state-of-the-art or comparable results
Abstract
This paper presents an approach to estimating the continuous 6-DoF pose of an object from a single RGB image. The approach combines semantic keypoints predicted by a convolutional network (convnet) with a deformable shape model. Unlike prior investigators, we are agnostic to whether the object is textured or textureless, as the convnet learns the optimal representation from the available training-image data. Furthermore, the approach can be applied to instance- and class-based pose recovery. Additionally, we accompany our main pipeline with a technique for semi-automatic data generation from unlabeled videos. This procedure allows us to train the learnable components of our method with minimal manual intervention in the labeling process. Empirically, we show that our approach can accurately recover the 6-DoF object pose for both instance- and class-based scenarios even against a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
