AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

Yutong Hu; Jan-Nico Zaech; Nikolay Nikolov; Yuanqi Yao; Sombit Dey; Giuliano Albanese; Renaud Detry; Luc Van Gool; Danda Paudel

arXiv:2603.10126·cs.RO·May 12, 2026

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

Yutong Hu, Jan-Nico Zaech, Nikolay Nikolov, Yuanqi Yao, Sombit Dey, Giuliano Albanese, Renaud Detry, Luc Van Gool, Danda Paudel

PDF

1 Repo

TL;DR

AR-VLA introduces a novel autoregressive action model for vision-language-action tasks, enabling context-aware, smooth, and scalable robotic policy generation with superior history awareness.

Contribution

It presents a standalone autoregressive Action Expert that maintains history, addresses temporal frequency mismatch, and integrates seamlessly with perception backbones.

Findings

01

Outperforms reactive VLAs in smoothness and history awareness.

02

Effectively replaces chunk-based action heads in manipulation tasks.

03

Maintains or exceeds state-of-the-art task success rates.

Abstract

We propose a standalone autoregressive (AR) Action Expert that generates actions as a continuous causal sequence while conditioning on refreshable vision-language prefixes. In contrast to existing Vision-Language-Action (VLA) models and diffusion policies that reset temporal context with each new observation and predict actions reactively, our Action Expert maintains its own history through a long-lived memory and is inherently context-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize a re-anchoring mechanism that mathematically accounts for perception staleness during both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://arvla.insait.ai
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.