SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding
Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao, Linfeng Zhang, Biqing Qi, and Bowen Zhou

TL;DR
SDAR-VL introduces a stable, efficient block-wise diffusion framework for vision-language understanding, significantly improving training stability, efficiency, and performance, and setting new state-of-the-art results among diffusion models.
Contribution
It is the first systematic application of block-wise discrete diffusion to large-scale vision-language tasks with an integrated framework for stable and efficient training.
Findings
Outperforms conventional block diffusion in efficiency and stability
Sets new state-of-the-art among diffusion-based VLU models
Matches or surpasses strong autoregressive baselines
Abstract
Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present \textbf{SDAR-VL}, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an \emph{integrated framework for efficient and stable training}. This framework unifies three components: (1) \textbf{Asynchronous Block-wise Noise Scheduling} to diversify supervision within each batch; (2) \textbf{Effective Mask Ratio Scaling} for unbiased loss normalization under stochastic masking; and (3) a \textbf{Progressive Beta Noise Curriculum} that increases…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
