MMaDA-VLA: Large Diffusion Vision-Language-Action Model with Unified Multi-Modal Instruction and Generation

Yang Liu; Pengxiang Ding; Tengyue Jiang; Xudong Wang; Wenxuan Song; Minghui Lin; Han Zhao; Hongyin Zhang; Zifeng Zhuang; Wei Zhao; Siteng Huang; Jinkui Shi; Donglin Wang

arXiv:2603.25406·cs.RO·March 30, 2026

MMaDA-VLA: Large Diffusion Vision-Language-Action Model with Unified Multi-Modal Instruction and Generation

Yang Liu, Pengxiang Ding, Tengyue Jiang, Xudong Wang, Wenxuan Song, Minghui Lin, Han Zhao, Hongyin Zhang, Zifeng Zhuang, Wei Zhao, Siteng Huang, Jinkui Shi, Donglin Wang

PDF

1 Repo 1 Models

TL;DR

MMaDA-VLA introduces a unified diffusion-based model for vision-language-action tasks, enabling consistent long-horizon robot control from visual and language inputs without extra modules.

Contribution

It proposes a native discrete diffusion framework that jointly models multi-modal understanding and generation in a single, unified architecture for robot manipulation.

Findings

01

Achieves 98.0% success on LIBERO benchmark.

02

Demonstrates state-of-the-art performance in real-world tasks.

03

Improves long-horizon consistency through iterative denoising.

Abstract

Vision-Language-Action (VLA) models aim to control robots for manipulation from visual observations and natural-language instructions. However, existing hierarchical and autoregressive paradigms often introduce architectural overhead, suffer from temporal inconsistency and long-horizon error accumulation, and lack a mechanism to capture environment dynamics without extra modules. To this end, we present MMaDA-VLA, a fully native pre-trained large diffusion VLA model that unifies multi-modal understanding and generation in a single framework. Our key idea is a native discrete diffusion formulation that embeds language, images, and continuous robot controls into one discrete token space and trains a single backbone with masked token denoising to jointly generate a future goal observation and an action chunk in parallel. Iterative denoising enables global, order-free refinement, improving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yliu-cs/MMaDA-VLA
github

Models

🤗
yliu-cs/MMaDA-VLA
model· 116 dl
116 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.