R3D: Revisiting 3D Policy Learning
Zhengdong Hong, Shenrui Wu, Haozhe Cui, Boyi Zhao, Ran Ji, Yiyang He, Hangxing Zhang, Zundong Ke, Jun Wang, Guofeng Zhang, Jiayuan Gu

TL;DR
This paper introduces R3D, a new scalable transformer-based 3D policy learning architecture that improves stability and performance in manipulation tasks by addressing training issues and leveraging large-scale pre-training.
Contribution
The work proposes a novel architecture combining a transformer-based 3D encoder with a diffusion decoder, specifically designed for stability and large-scale pre-training in 3D policy learning.
Findings
Outperforms state-of-the-art 3D baselines on manipulation benchmarks
Identifies key issues like lack of 3D data augmentation and Batch Normalization effects
Establishes a new robust foundation for scalable 3D imitation learning
Abstract
3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing a new and robust foundation for scalable 3D imitation learning. Project Page: https://r3d-policy.github.io/
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
