TL;DR
SilhoNet is a novel CNN-based method that estimates 6D object pose from monocular images by predicting silhouettes and translation, outperforming existing monocular approaches on the YCB-Video dataset.
Contribution
Introduces SilhoNet, a new monocular 6D pose estimation method using silhouette prediction and CNNs, eliminating the need for RGB-D sensors.
Findings
Achieves superior performance on YCB-Video dataset
Outperforms two state-of-the-art monocular pose estimation networks
Effectively predicts 6D pose using only monocular RGB images
Abstract
Autonomous robot manipulation involves estimating the translation and orientation of the object to be manipulated as a 6-degree-of-freedom (6D) pose. Methods using RGB-D data have shown great success in solving this problem. However, there are situations where cost constraints or the working environment may limit the use of RGB-D sensors. When limited to monocular camera data only, the problem of object pose estimation is very challenging. In this work, we introduce a novel method called SilhoNet that predicts 6D object pose from monocular images. We use a Convolutional Neural Network (CNN) pipeline that takes in Region of Interest (ROI) proposals to simultaneously predict an intermediate silhouette representation for objects with an associated occlusion mask and a 3D translation vector. The 3D orientation is then regressed from the predicted silhouettes. We show that our method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
