Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation
Chongyang Xu, Haipeng Li, Shen Cheng, Jingyu Hu, Haoqiang Fan, Ziliang Feng, Shuaicheng Liu

TL;DR
This paper introduces a novel bimanual manipulation framework that leverages pre-trained 3D geometric models and diffusion-based prediction to improve spatial understanding and coordination from RGB images, outperforming existing methods.
Contribution
It proposes a new approach combining 3D geometric priors, semantic features, and diffusion models for predictive bimanual manipulation from RGB data.
Findings
Outperforms 2D and point-cloud baselines in simulation and real-world tasks.
Achieves state-of-the-art success rates in manipulation and coordination.
Demonstrates accurate 3D scene evolution prediction from RGB images.
Abstract
Bimanual manipulation requires policies that can reason about 3D geometry, anticipate how it evolves under action, and generate smooth, coordinated motions. However, existing methods typically rely on 2D features with limited spatial awareness, or require explicit point clouds that are difficult to obtain reliably in real-world settings. At the same time, recent 3D geometric foundation models show that accurate and diverse 3D structure can be reconstructed directly from RGB images in a fast and robust manner. We leverage this opportunity and propose a framework that builds bimanual manipulation directly on a pre-trained 3D geometric foundation model. Our policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses diffusion model to jointly predict a future action chunk and a future 3D latent that decodes into a dense…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · 3D Shape Modeling and Analysis · Human Pose and Action Recognition
