Physical Autoregressive Model for Robotic Manipulation without Action Pretraining
Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, Guangrun Wang

TL;DR
This paper introduces a Physical Autoregressive Model (PAR) that leverages video pretraining to understand physical dynamics in robotic manipulation without needing action pretraining, achieving high success rates and accurate predictions.
Contribution
The paper presents a novel PAR model that combines frames and actions as physical tokens, utilizing video pretraining for manipulation tasks without action pretraining, and introduces a DiT-based de-tokenizer and efficiency improvements.
Findings
Achieves 100% success on PushCube in ManiSkill benchmark.
Matches action-pretrained baselines on other tasks.
Accurately predicts future videos with aligned action trajectories.
Abstract
The scarcity of manipulation data has motivated the use of pretrained large models from other modalities in robotics. In this work, we build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR), where physical tokens combine frames and actions to represent the joint evolution of the robot and its environment. PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining, enabling accurate video prediction and consistent action trajectories. It also adopts a DiT-based de-tokenizer to model frames and actions as continuous tokens, mitigating quantization errors and facilitating mutual enhancement. Furthermore, we incorporate a causal mask with inverse kinematics, parallel training, and the KV-cache mechanism to further improve performance and efficiency. Experiments on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
