TL;DR
AR-VLA introduces a novel autoregressive action model for vision-language-action tasks, enabling context-aware, smooth, and scalable robotic policy generation with superior history awareness.
Contribution
It presents a standalone autoregressive Action Expert that maintains history, addresses temporal frequency mismatch, and integrates seamlessly with perception backbones.
Findings
Outperforms reactive VLAs in smoothness and history awareness.
Effectively replaces chunk-based action heads in manipulation tasks.
Maintains or exceeds state-of-the-art task success rates.
Abstract
We propose a standalone autoregressive (AR) Action Expert that generates actions as a continuous causal sequence while conditioning on refreshable vision-language prefixes. In contrast to existing Vision-Language-Action (VLA) models and diffusion policies that reset temporal context with each new observation and predict actions reactively, our Action Expert maintains its own history through a long-lived memory and is inherently context-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize a re-anchoring mechanism that mathematically accounts for perception staleness during both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
