Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

Jiacheng Ye; Shansan Gong; Jiahui Gao; Junming Fan; Shuang Wu; Wei Bi; Haoli Bai; Lifeng Shang; Lingpeng Kong

arXiv:2512.22615·cs.CV·January 6, 2026

Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu, Wei Bi, Haoli Bai, Lifeng Shang, Lingpeng Kong

PDF

Open Access 2 Models

TL;DR

This paper introduces Dream-VL and Dream-VLA, diffusion-based vision-language models that outperform autoregressive models in visual planning and robotic control tasks, with faster convergence and state-of-the-art results.

Contribution

The paper presents the first open diffusion-based vision-language and vision-language-action models, demonstrating their superior performance and efficiency over traditional autoregressive models.

Findings

01

Dream-VL achieves state-of-the-art performance among diffusion-based models.

02

Dream-VLA surpasses leading models on multiple robotic benchmarks.

03

Diffusion backbone enables faster fine-tuning and action chunking.

Abstract

While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis