RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen,, Zhengyi Wang, Ke Xu, Hang Su, Jun Zhu

TL;DR
This paper introduces RDT-1B, a large diffusion-based foundation model for bimanual robotic manipulation, capable of zero-shot generalization, language understanding, and learning from few demonstrations, addressing data scarcity and multi-modality challenges.
Contribution
The paper presents the first large-scale diffusion foundation model for bimanual manipulation, with novel scalable Transformer design and a unified action space for transferability and interpretability.
Findings
RDT-1B scales up to 1.2 billion parameters, the largest in robotic manipulation.
RDT-1B outperforms existing methods in real robot experiments.
RDT-1B demonstrates zero-shot generalization and few-shot learning capabilities.
Abstract
Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high frequency of robotic data. To address data scarcity, we further introduce a Physically Interpretable Unified Action Space, which can unify the action representations of various robots while preserving the physical meanings of original actions, facilitating learning transferrable physical knowledge.…
Peer Reviews
Decision·ICLR 2025 Poster
The paper presents a complete and remarkable research work that pushes forward the boundary of large-scale robot learning. - The model is developed on top of the diffusion transformer with a unified action space, which allows large-scale pretraining on heterogeneous robot data to boost the performance - The authors collect the largest robot dataset for bimanual manipulation with comprehensive task coverage for fine-tuning the model - The experiments show that the advantage of the model from a f
- While the paper demonstrates that the foundation model is allows zero-shot and few-shot generalization, and can achieve dexterous manipulation, each of these characteristics is only validated on ~one task and may be insufficient. Evaluations on more tasks and existing benchmark tasks will complete the results. - It seems that the baselines are not trained on the complete fine-tuning dataset. This doesn't form an apple-to-apple comparison. - The writing of the paper has room for improvement. So
This paper demonstrates strong performance in scaling up robotics models. It presents several interesting components that improve training stability and performance. The unified action space, and especially the padding technique, is interesting. The paper shows capabilities on several challenging real-world bimanual manipulation tasks.
Several claims are not very precise and not very clear. For example, the authors mention the nonlinearity and high frequency of robotic data. While it is true that the data is nonlinear, how does the proposed method tackle this challenge? The authors argue that changing the last linear layer to an MLP block solves this problem and brings significant performance improvements. While the performance is impressive, I think this requires more careful ablation experiments. Firstly, the entire diffusio
- The paper introduces a novel application of diffusion models to bimanual manipulation, addressing the high-dimensional, multi-modal action space through a Physically Interpretable Unified Action Space. This approach is a creative extension of diffusion models in robotics, particularly for dual-arm coordination, a challenging domain with limited prior work. - The model is rigorously tested, with comprehensive experiments demonstrating superior performance over existing baselines. The use of th
- The paper introduces a Physically Interpretable Unified Action Space for handling data heterogeneity, but additional details on potential limitations or failure cases during training with highly diverse data would be beneficial. This could include examples where action standardization might lead to loss of unique features across robots. - Although the experiments show impressive results, expanding the evaluation to more varied and complex real-world tasks (beyond the 6,000-episode dataset) an
Code & Models
Videos
Taxonomy
TopicsMechanics and Biomechanics Studies · Robotic Mechanisms and Dynamics · Muscle activation and electromyography studies
MethodsAttention Is All You Need · Dense Connections · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Linear Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings
