R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation

Yuhao Zhang; Wanxi Dong; Yue Shi; Yi Liang; Jingnan Gao; Qiaochu Yang; Yaxing Lyu; Zhixuan Liang; Yibin Liu; Congsheng Xu; Xianda Guo; Wei Sui; Yaohui Jin; Xiaokang Yang; Yanyan Xu; Yao Mu

arXiv:2603.14498·cs.RO·March 31, 2026

R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation

Yuhao Zhang, Wanxi Dong, Yue Shi, Yi Liang, Jingnan Gao, Qiaochu Yang, Yaxing Lyu, Zhixuan Liang, Yibin Liu, Congsheng Xu, Xianda Guo, Wei Sui, Yaohui Jin, Xiaokang Yang, Yanyan Xu, Yao Mu

PDF

TL;DR

R3DP introduces a real-time 3D-aware manipulation policy that efficiently integrates large-scale 3D priors using asynchronous modules, significantly improving success rates and reducing inference time in embodied manipulation tasks.

Contribution

The paper presents R3DP, a novel system combining asynchronous modules and multi-view fusion to incorporate large 3D priors into real-time manipulation policies without latency penalties.

Findings

01

R3DP outperforms baselines with 32.9% and 51.4% higher success rates in different configurations.

02

It reduces inference time by 44.8% compared to naive integration methods.

03

The system effectively leverages temporal and multi-view data for improved manipulation performance.

Abstract

Embodied manipulation requires accurate 3D understanding of objects and their spatial relations to plan and execute contact-rich actions. While large-scale 3D vision models provide strong priors, their computational cost incurs prohibitive latency for real-time control. We propose Real-time 3D-aware Policy (R3DP), which integrates powerful 3D priors into manipulation policies without sacrificing real-time performance. A core innovation of R3DP is the asynchronous fast-slow collaboration module, which seamlessly integrates large-scale 3D priors into the policy without compromising real-time performance. The system maintains real-time efficiency by querying the pre-trained slow system (VGGT) only on sparse key frames, while simultaneously employing a lightweight Temporal Feature Prediction Network (TFPNet) to predict features for all intermediate frames. By leveraging historical data to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.