DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Lunbin Zeng; Jingfeng Yao; Bencheng Liao; Hongyuan Tao; Wenyu Liu; Xinggang Wang

arXiv:2512.15713·cs.CV·April 1, 2026

DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu, Xinggang Wang

PDF

1 Repo 3 Models

TL;DR

DiffusionVL introduces a method to convert pretrained autoregressive vision language models into diffusion models, achieving high performance with less data and faster inference, bridging the gap between AR and diffusion approaches.

Contribution

The paper presents an efficient diffusion finetuning process to transform existing AR models into diffusion vision language models without changing architecture.

Findings

01

DiffusionVL achieves 34.4% improvement on MMMU-Pro benchmark.

02

It attains 37.5% gain on MME benchmark.

03

The method doubles inference speed compared to prior approaches.

Abstract

Diffusion-based decoding has recently emerged as an appealing alternative to autoregressive (AR) generation, offering the potential to update multiple tokens in parallel and reduce latency. However, diffusion vision language models (dVLMs) still lag significantly behind mainstream autoregressive vision language models. This is due to the scarcity and weaker performance of base diffusion language models (dLLMs) compared with their autoregressive counterparts. This raises a natural question: Can we build high-performing dVLMs directly from existing powerful AR models, without relying on dLLMs? We propose DiffusionVL, a family of dVLMs obtained by translating pretrained AR models into the diffusion paradigm via an efficient diffusion finetuning procedure that changes the training objective and decoding process while keeping the backbone architecture intact. Through an efficient diffusion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hustvl/DiffusionVL
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.