BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning
Ruiheng Wang, Shuanghao Bai, Haoran Zhang, Badong Chen, Xiangyu Xu

TL;DR
BlockVLA introduces a block diffusion framework that accelerates autoregressive vision-language-action models, enabling faster inference and training efficiency for robotic tasks by combining global causal coherence with local parallel generation.
Contribution
It adapts pretrained autoregressive models into an efficient diffusion policy using block diffusion, reducing inference latency and improving training speed in robotic applications.
Findings
Achieves 3.3× inference acceleration over standard diffusion baselines.
Converges faster in training, especially in complex, long-horizon tasks.
Demonstrates superior performance on LIBERO and SimplerEnv benchmarks.
Abstract
While autoregressive (AR) Vision-Language-Action (VLA) models have demonstrated formidable reasoning capabilities in robotic tasks, their sequential decoding process often incurs high inference latency and may amplify error accumulation during long-horizon execution. Discrete Diffusion Language Models (dLLMs) provide a promising alternative through parallel token refinement, but their practical deployment in robotics remains limited by repeated denoising function evaluations (NFEs) and the difficulty of directly applying standard KV caching to bidirectional iterative decoding. To bridge these paradigms, we propose BlockVLA, a framework that adapts pretrained AR backbones into an efficient discrete diffusion policy through a block diffusion paradigm. BlockVLA maintains autoregressive dependencies at the block level while enabling parallel denoising within each block, thereby combining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
