Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process
Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, Haoang Li

TL;DR
Unified Diffusion VLA introduces a joint diffusion process that integrates vision, language, and actions into a single model, enabling more efficient and synergistic understanding, generation, and action execution in embodied agents.
Contribution
The paper proposes a novel joint diffusion process and a unified tokenized space for multimodal integration, improving synergy and efficiency in vision-language-action models.
Findings
Achieves state-of-the-art results on CALVIN, LIBERO, and SimplerEnv benchmarks.
Offers 4× faster inference compared to autoregressive methods.
Demonstrates effective real-world application and in-depth analysis.
Abstract
Vision-language-action (VLA) models aim to understand natural language instructions and visual observations and to execute corresponding actions as an embodied agent. Recent work integrates future images into the understanding-acting loop, yielding unified VLAs that jointly understand, generate, and act -- reading text and images and producing future images and actions. However, these models either rely on external experts for modality unification or treat image generation and action prediction as separate processes, limiting the benefits of direct synergy between these tasks. Our core philosophy is to optimize generation and action jointly through a synchronous denoising process, where the iterative refinement enables actions to evolve from initialization, under constant and sufficient visual guidance. We ground this philosophy in our proposed Unified Diffusion VLA and Joint Discrete…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
