Robo3R: Enhancing Robotic Manipulation with Accurate Feed-Forward 3D Reconstruction
Sizhe Yang, Linning Xu, Hao Li, Juncheng Mu, Jia Zeng, Dahua Lin, Jiangmiao Pang

TL;DR
Robo3R is a real-time, feed-forward 3D reconstruction model that predicts accurate, metric-scale scene geometry from RGB images and robot states, improving robotic manipulation tasks.
Contribution
Introduces Robo3R, a novel 3D reconstruction approach that combines local geometry inference and camera pose refinement for manipulation-ready scene understanding.
Findings
Outperforms state-of-the-art reconstruction methods and depth sensors.
Enhances downstream tasks like grasp synthesis and motion planning.
Trained on a large synthetic dataset with 4 million frames.
Abstract
3D spatial perception is fundamental to generalizable robotic manipulation, yet obtaining reliable, high-quality 3D geometry remains challenging. Depth sensors suffer from noise and material sensitivity, while existing reconstruction models lack the precision and metric consistency required for physical interaction. We introduce Robo3R, a feed-forward, manipulation-ready 3D reconstruction model that predicts accurate, metric-scale scene geometry directly from RGB images and robot states in real time. Robo3R jointly infers scale-invariant local geometry and relative camera poses, which are unified into the scene representation in the canonical robot frame via a learned global similarity transformation. To meet the precision demands of manipulation, Robo3R employs a masked point head for sharp, fine-grained point clouds, and a keypoint-based Perspective-n-Point (PnP) formulation to refine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
