Monocular One-Shot Metric-Depth Alignment for RGB-Based Robot Grasping
Teng Guo, Baichuan Huang, Jingjin Yu

TL;DR
This paper introduces MOMA, a novel method for accurately estimating metric depth from a single RGB image by aligning monocular depth models with sparse ground-truth points, enhancing robotic grasping tasks.
Contribution
MOMA is the first framework to perform one-shot metric depth alignment for RGB images, enabling accurate depth estimation without retraining models for specific setups.
Findings
MOMA achieves high success rates in robotic grasping tasks.
Supports fine-tuning for transparent objects.
Demonstrates strong generalization in real-world experiments.
Abstract
Accurate 6D object pose estimation is a prerequisite for successfully completing robotic prehensile and non-prehensile manipulation tasks. At present, 6D pose estimation for robotic manipulation generally relies on depth sensors based on, e.g., structured light, time-of-flight, and stereo-vision, which can be expensive, produce noisy output (as compared with RGB cameras), and fail to handle transparent objects. On the other hand, state-of-the-art monocular depth estimation models (MDEMs) provide only affine-invariant depths up to an unknown scale and shift. Metric MDEMs achieve some successful zero-shot results on public datasets, but fail to generalize. We propose a novel framework, Monocular One-shot Metric-depth Alignment (MOMA), to recover metric depth from a single RGB image, through a one-shot adaptation building on MDEM techniques. MOMA performs scale-rotation-shift alignments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Advanced Vision and Imaging
