BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

Ruiheng Wang; Shuanghao Bai; Haoran Zhang; Badong Chen; Xiangyu Xu

arXiv:2605.13382·cs.RO·May 14, 2026

BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

Ruiheng Wang, Shuanghao Bai, Haoran Zhang, Badong Chen, Xiangyu Xu

PDF

TL;DR

BlockVLA introduces a block diffusion framework that accelerates autoregressive vision-language-action models, enabling faster inference and training efficiency for robotic tasks by combining global causal coherence with local parallel generation.

Contribution

It adapts pretrained autoregressive models into an efficient diffusion policy using block diffusion, reducing inference latency and improving training speed in robotic applications.

Findings

01

Achieves 3.3× inference acceleration over standard diffusion baselines.

02

Converges faster in training, especially in complex, long-horizon tasks.

03

Demonstrates superior performance on LIBERO and SimplerEnv benchmarks.

Abstract

While autoregressive (AR) Vision-Language-Action (VLA) models have demonstrated formidable reasoning capabilities in robotic tasks, their sequential decoding process often incurs high inference latency and may amplify error accumulation during long-horizon execution. Discrete Diffusion Language Models (dLLMs) provide a promising alternative through parallel token refinement, but their practical deployment in robotics remains limited by repeated denoising function evaluations (NFEs) and the difficulty of directly applying standard KV caching to bidirectional iterative decoding. To bridge these paradigms, we propose BlockVLA, a framework that adapts pretrained AR backbones into an efficient discrete diffusion policy through a block diffusion paradigm. BlockVLA maintains autoregressive dependencies at the block level while enabling parallel denoising within each block, thereby combining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.