TL;DR
Fast-dVLM introduces a block-diffusion approach for vision-language models, enabling parallel decoding and significant inference speedup while maintaining quality, especially suited for edge deployment in robotics and autonomous systems.
Contribution
The paper proposes a direct conversion method for efficient block diffusion in VLMs, enabling faster inference without sacrificing multimodal performance.
Findings
Fast-dVLM matches AR models in generation quality across 11 benchmarks.
Achieves over 6x inference speedup with FP8 quantization and SGLang.
Introduces novel multimodal diffusion adaptations and techniques for effective block diffusion.
Abstract
Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in physical AI scenarios such as robotics and autonomous driving, where VLMs are deployed on edge devices at batch size one, making AR decoding memory-bandwidth-bound and leaving hardware parallelism underutilized. While block-wise discrete diffusion has shown promise for parallel text generation, extending it to VLMs remains challenging due to the need to jointly handle continuous visual representations and discrete text tokens while preserving pretrained multimodal capabilities. We present Fast-dVLM, a block-diffusion-based VLM that enables KV-cache-compatible parallel decoding and speculative block decoding for inference acceleration. We systematically compare two AR-to-diffusion conversion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
