Autoregressive Action Sequence Learning for Robotic Manipulation
Xinyu Zhang, Yuhan Liu, Haonan Chang, Liam Schramm, Abdeslam Boularias

TL;DR
This paper introduces a novel autoregressive sequence modeling approach for robotic manipulation, enabling a universal, efficient, and flexible policy architecture that outperforms existing methods across diverse tasks.
Contribution
We propose Chunking Causal Transformer (CCT) and Autoregressive Policy (ARP), enabling hybrid action sequences and improved performance across various robotic manipulation environments.
Findings
ARP matches or outperforms environment-specific state-of-the-art methods
Our approach is more computationally efficient and has fewer parameters
Effective across diverse tasks including Push-T, ALOHA, and RLBench
Abstract
Designing a universal policy architecture that performs well across diverse robots and task configurations remains a key challenge. In this work, we address this by representing robot actions as sequential data and generating actions through autoregressive sequence modeling. Existing autoregressive architectures generate end-effector waypoints sequentially as word tokens in language modeling, which are limited to low-frequency control tasks. Unlike language, robot actions are heterogeneous and often include continuous values -- such as joint positions, 2D pixel coordinates, and end-effector poses -- which are not easily suited for language-based modeling. Based on this insight, we introduce a straightforward enhancement: we extend causal transformers' single-token prediction to support predicting a variable number of tokens in a single step through our Chunking Causal Transformer (CCT).…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- Developing better architectures for policy learning can have a large impact in practice - The paper describes the proposed architecture clearly and with detailed diagrams - The proposed approach outperforms the baselines in tested settings
- The novelty of the proposed architecture is limited. The overall autoregressive architecture is fairly standard. The main architectural change is the use of empty tokens for multi-step prediction. - The empirical results are limited to relatively simple settings in simulation. It is a bit hard to draw conclusions about robotic manipulation performance based on these experiments alone. - Related work can be improved. It would be good to focus more on architectural changes which are the main foc
- The proposed approach to action sequence prediction looks new and interesting. While recent studies have pursued multi-token prediction or autoregressive generation separately, the potential benefits of their combination are underexplored and worth studying. - The analysis of design choices, such as chunk size, autoregression, and hierarchical structure, is extensive and clarifies the core contributions in general.
- The technical insights are somewhat limited. In particular, Table 1 shows autoregression outperforms multi-token prediction; why is multi-token prediction still used? Further insight into their combined benefit could strengthen the contribution. - The chunk size analysis in Section 4.2 is difficult to interpret. The results in scenarios 2,3,4 are contrary to recent literature [1,2], where longer chunks are often favored over no chunks for temporal dependencies. Additional details, e.g., defaul
1. Multi-token prediction improves inference speed and accuracy. 2. Attention interleaving allows the model to be trained with teacher forcing. 3. A real-world environment is also used for evaluating the model.
1. Multi-token prediction for action chunks has been explored in many previous works, such as their cited Diffusion Policy and MDT (Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals). I acknowledge that they are diffusion-based policies, but it shows that the multi-token prediction scheme has been explored before and thus is not very novel. 2. Teacher-forcing is a common training method for sequence-to-sequence models. The fact that ARP needs special attention i
The paper is well-written and easy to follow. The article includes ample details of experiments and methods to aid in understanding and reproducibility. The paper conducts numerous analytical experiments, qualitative experiments, and discussions to attempt to analyze the performance changes brought about by each module in the method, as well as to explain the working mechanisms of the autoregressive pattern under specific conditions (e.g., prediction with human guidance). From the experimental
The main weaknesses of this paper lie in its limited novelty and the unclear motivation and contributions compared to related work. Firstly, transforming the output from a single token to multiple tokens cannot be considered a significant contribution, as many existing models are also capable of multi-step output. And the effectiveness of action chunking has been explored in many cases such as Diffusion Policy and ACT. Secondly, while the paper proposes an auto-regressive design, it does not
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Hand Gesture Recognition Systems
MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings
