MMaDA: Multimodal Large Diffusion Language Models

Ling Yang; Ye Tian; Bowen Li; Xinchen Zhang; Ke Shen; Yunhai Tong; Mengdi Wang

arXiv:2505.15809·cs.CV·September 26, 2025

MMaDA: Multimodal Large Diffusion Language Models

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang

PDF

1 Repo 2 Models

TL;DR

MMaDA introduces a unified multimodal diffusion model with innovative training and reinforcement learning strategies, achieving superior performance across textual reasoning, multimodal understanding, and image generation tasks.

Contribution

The paper presents a novel unified diffusion architecture, a mixed chain-of-thought fine-tuning strategy, and a new RL algorithm, UniGRPO, for improved multimodal foundation modeling.

Findings

01

Outperforms LLaMA-3-7B and Qwen2-7B in textual reasoning

02

Surpasses Show-o and SEED-X in multimodal understanding

03

Exceeds SDXL and Janus in text-to-image generation

Abstract

We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gen-verse/mmada
jaxOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDiffusion