Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

Chengyue Wu; Shiyi Lan; Yonggan Fu; Sensen Gao; Jin Wang; Jincheng Yu; Jose M. Alvarez; Pavlo Molchanov; Ping Luo; Song Han; Ligeng Zhu; Enze Xie

arXiv:2604.06832·cs.CL·April 13, 2026

Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

Chengyue Wu, Shiyi Lan, Yonggan Fu, Sensen Gao, Jin Wang, Jincheng Yu, Jose M. Alvarez, Pavlo Molchanov, Ping Luo, Song Han, Ligeng Zhu, Enze Xie

PDF

2 Models

TL;DR

Fast-dVLM introduces a block-diffusion approach for vision-language models, enabling parallel decoding and significant inference speedup while maintaining quality, especially suited for edge deployment in robotics and autonomous systems.

Contribution

The paper proposes a direct conversion method for efficient block diffusion in VLMs, enabling faster inference without sacrificing multimodal performance.

Findings

01

Fast-dVLM matches AR models in generation quality across 11 benchmarks.

02

Achieves over 6x inference speedup with FP8 quantization and SGLang.

03

Introduces novel multimodal diffusion adaptations and techniques for effective block diffusion.

Abstract

Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in physical AI scenarios such as robotics and autonomous driving, where VLMs are deployed on edge devices at batch size one, making AR decoding memory-bandwidth-bound and leaving hardware parallelism underutilized. While block-wise discrete diffusion has shown promise for parallel text generation, extending it to VLMs remains challenging due to the need to jointly handle continuous visual representations and discrete text tokens while preserving pretrained multimodal capabilities. We present Fast-dVLM, a block-diffusion-based VLM that enables KV-cache-compatible parallel decoding and speculative block decoding for inference acceleration. We systematically compare two AR-to-diffusion conversion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.