Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation

Chongyang Xu; Haipeng Li; Shen Cheng; Jingyu Hu; Haoqiang Fan; Ziliang Feng; Shuaicheng Liu

arXiv:2602.23814·cs.CV·March 2, 2026

Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation

Chongyang Xu, Haipeng Li, Shen Cheng, Jingyu Hu, Haoqiang Fan, Ziliang Feng, Shuaicheng Liu

PDF

Open Access

TL;DR

This paper introduces a novel bimanual manipulation framework that leverages pre-trained 3D geometric models and diffusion-based prediction to improve spatial understanding and coordination from RGB images, outperforming existing methods.

Contribution

It proposes a new approach combining 3D geometric priors, semantic features, and diffusion models for predictive bimanual manipulation from RGB data.

Findings

01

Outperforms 2D and point-cloud baselines in simulation and real-world tasks.

02

Achieves state-of-the-art success rates in manipulation and coordination.

03

Demonstrates accurate 3D scene evolution prediction from RGB images.

Abstract

Bimanual manipulation requires policies that can reason about 3D geometry, anticipate how it evolves under action, and generate smooth, coordinated motions. However, existing methods typically rely on 2D features with limited spatial awareness, or require explicit point clouds that are difficult to obtain reliably in real-world settings. At the same time, recent 3D geometric foundation models show that accurate and diverse 3D structure can be reconstructed directly from RGB images in a fast and robust manner. We leverage this opportunity and propose a framework that builds bimanual manipulation directly on a pre-trained 3D geometric foundation model. Our policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses diffusion model to jointly predict a future action chunk and a future 3D latent that decodes into a dense…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · 3D Shape Modeling and Analysis · Human Pose and Action Recognition