Diffusion Transformer Policy
Zhi Hou, Tianyi Zhang, Yuwen Xiong, Hengjun Pu, Chengyang Zhao,, Ronglei Tong, Yu Qiao, Jifeng Dai, Yuntao Chen

TL;DR
This paper introduces Diffusion Transformer Policy, a novel approach that models continuous robot actions with a large transformer to improve generalization across diverse datasets and environments.
Contribution
It proposes a large multi-modal diffusion transformer for continuous action modeling, surpassing small action heads in handling diverse action spaces and improving generalization.
Findings
Achieves state-of-the-art performance on multiple benchmarks.
Demonstrates effective generalization to real-world robot tasks.
Improves task completion rates and success sequence lengths.
Abstract
Recent large vision-language-action models pretrained on diverse robot datasets have demonstrated the potential for generalizing to new environments with a few in-domain data. However, those approaches usually predict individual discretized or continuous action by a small action head, which limits the ability in handling diverse action spaces. In contrast, we model the continuous action sequence with a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, in which we directly denoise action chunks by a large transformer model rather than a small action head for action embedding. By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets, and achieve better generalization performance. Extensive experiments demonstrate the effectiveness and generalization…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- Extensive experimental results provided in several simulation environments and real-world task - Reasonable set of baselines for comparison - Reports state-of-the-art performance on the Calvin benchmark - Clearly states training hyperparameters and experimental design to replicate their experiments
- Several typos and sentence style issues throughout the paper - Motivation for the architecture modification is not clearly stated, not clear from reading the introduction. - Figure 2 is easy to follow, but visually poorly constructed. - Action tokenization procedure is unclear. Are the continuous actions being converted into discrete bins similar to some of the prior works? If it is not, then what is action tokenization is Section 3.1? - Some architectural design choices are not motivated, i.e
1. The paper utilizes DiT in an in-context conditioning style for continuous action chunk denoising, incorporating the strong scaling ability of Transformers for generalist robot policy learning. While several concurrent works also uses DiT to improve the manipulation policy, the idea has its novelty. The policy architecture is also well explained and illustrated. 2. Thorough ablations on trajectory length, observation length and execution steps.
1. **Real-World Experiments**. - **Task Setting**. The task settings are too easy with only “picking” operations. Despite that the authors claim that it is a challenging setup with small object (L370-371), I believe that only picking small objects would not be so difficult. I would suggest adding real-world manipulation experiments besides picking (and pick-and-place), *e.g.*, open drawer/door, pouring (with rotation actions), and long-horizon tasks, etc. - **Few-Shot Setting**. Collecti
The writing and organization of this paper are good. The simulation experiments on Calvin and Maniskill are solid.
The fact that conducting pre-training on the Open-X dataset is a significant advantage of this paper. However, the real-world evaluation is too simple, only the picking skill. This is not sufficient to prove the effectiveness of the methods in this paper compared to OpenVLA and OCto. If the evaluation scenarios include additional skills, it would be better. It lacks the parameter scaling experiments of the Causal Transformer part. For example, the impact of different amounts of parameters and
The paper shows remarkable generalization capabilities across different simulated and real-world environments. Utilizing a large-scale Transformer model for action denoising demonstrates superior scalability compared to previous small-scale MLP-based denoising models and discretization actions. It shows significant advantages over existing methods, even when facing changes in camera perspectives or environmental variations.
I greatly respect the experimental design and the impressive generalization results presented in this paper. This work offers meaningful contributions on the engineering front. However, I have some questions regarding the academic novelty of the research. Regarding the entire pipeline, employing the casual Transformer to accept both text tokens and image tokens as multimodal inputs is an approach that has been extensively used in prior research. Moreover, this paper's encoding of natural langu
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRenewable energy and sustainable power systems · EU Law and Policy Analysis · Electric Power System Optimization
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Multi-Head Attention · Adam · Dropout
