SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

Shuang Cheng; Yuhua Jiang; Zineng Zhou; Dawei Liu; Wang Tao; Linfeng Zhang; Biqing Qi; and Bowen Zhou

arXiv:2512.14068·cs.CV·December 17, 2025

SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao, Linfeng Zhang, Biqing Qi, and Bowen Zhou

PDF

Open Access

TL;DR

SDAR-VL introduces a stable, efficient block-wise diffusion framework for vision-language understanding, significantly improving training stability, efficiency, and performance, and setting new state-of-the-art results among diffusion models.

Contribution

It is the first systematic application of block-wise discrete diffusion to large-scale vision-language tasks with an integrated framework for stable and efficient training.

Findings

01

Outperforms conventional block diffusion in efficiency and stability

02

Sets new state-of-the-art among diffusion-based VLU models

03

Matches or surpasses strong autoregressive baselines

Abstract

Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present \textbf{SDAR-VL}, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an \emph{integrated framework for efficient and stable training}. This framework unifies three components: (1) \textbf{Asynchronous Block-wise Noise Scheduling} to diversify supervision within each batch; (2) \textbf{Effective Mask Ratio Scaling} for unbiased loss normalization under stochastic masking; and (3) a \textbf{Progressive Beta Noise Curriculum} that increases…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis