Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process

Jiayi Chen; Wenxuan Song; Pengxiang Ding; Ziyang Zhou; Han Zhao; Feilong Tang; Donglin Wang; Haoang Li

arXiv:2511.01718·cs.RO·March 26, 2026

Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process

Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, Haoang Li

PDF

Open Access 1 Models

TL;DR

Unified Diffusion VLA introduces a joint diffusion process that integrates vision, language, and actions into a single model, enabling more efficient and synergistic understanding, generation, and action execution in embodied agents.

Contribution

The paper proposes a novel joint diffusion process and a unified tokenized space for multimodal integration, improving synergy and efficiency in vision-language-action models.

Findings

01

Achieves state-of-the-art results on CALVIN, LIBERO, and SimplerEnv benchmarks.

02

Offers 4× faster inference compared to autoregressive methods.

03

Demonstrates effective real-world application and in-depth analysis.

Abstract

Vision-language-action (VLA) models aim to understand natural language instructions and visual observations and to execute corresponding actions as an embodied agent. Recent work integrates future images into the understanding-acting loop, yielding unified VLAs that jointly understand, generate, and act -- reading text and images and producing future images and actions. However, these models either rely on external experts for modality unification or treat image generation and action prediction as separate processes, limiting the benefits of direct synergy between these tasks. Our core philosophy is to optimize generation and action jointly through a synchronous denoising process, where the iterative refinement enables actions to evolve from initialization, under constant and sufficient visual guidance. We ground this philosophy in our proposed Unified Diffusion VLA and Joint Discrete…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
chenpyyy/UD-VLA_CALVIN_ABCD_D
model· 94 dl· ♡ 1
94 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning