Monocular One-Shot Metric-Depth Alignment for RGB-Based Robot Grasping

Teng Guo; Baichuan Huang; Jingjin Yu

arXiv:2506.17110·cs.RO·June 23, 2025

Monocular One-Shot Metric-Depth Alignment for RGB-Based Robot Grasping

Teng Guo, Baichuan Huang, Jingjin Yu

PDF

Open Access

TL;DR

This paper introduces MOMA, a novel method for accurately estimating metric depth from a single RGB image by aligning monocular depth models with sparse ground-truth points, enhancing robotic grasping tasks.

Contribution

MOMA is the first framework to perform one-shot metric depth alignment for RGB images, enabling accurate depth estimation without retraining models for specific setups.

Findings

01

MOMA achieves high success rates in robotic grasping tasks.

02

Supports fine-tuning for transparent objects.

03

Demonstrates strong generalization in real-world experiments.

Abstract

Accurate 6D object pose estimation is a prerequisite for successfully completing robotic prehensile and non-prehensile manipulation tasks. At present, 6D pose estimation for robotic manipulation generally relies on depth sensors based on, e.g., structured light, time-of-flight, and stereo-vision, which can be expensive, produce noisy output (as compared with RGB cameras), and fail to handle transparent objects. On the other hand, state-of-the-art monocular depth estimation models (MDEMs) provide only affine-invariant depths up to an unknown scale and shift. Metric MDEMs achieve some successful zero-shot results on public datasets, but fail to generalize. We propose a novel framework, Monocular One-shot Metric-depth Alignment (MOMA), to recover metric depth from a single RGB image, through a one-shot adaptation building on MDEM techniques. MOMA performs scale-rotation-shift alignments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Advanced Vision and Imaging