Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Lijiang Li; Zuwei Long; Yunhang Shen; Heting Gao; Haoyu Cao; Xing Sun; Caifeng Shan; Ran He; Chaoyou Fu

arXiv:2603.06577·cs.CV·March 9, 2026

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, Chaoyou Fu

PDF

Open Access 1 Models

TL;DR

Omni-Diffusion introduces a unified mask-based discrete diffusion model for multimodal understanding and generation across text, speech, and images, outperforming existing models on various benchmarks.

Contribution

It is the first multimodal language model built entirely on mask-based discrete diffusion, unifying understanding and generation across multiple modalities.

Findings

01

Outperforms or matches existing multimodal systems on benchmarks

02

Supports complex multimodal tasks involving multiple modalities

03

Demonstrates the potential of diffusion models for multimodal foundation modeling

Abstract

While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
lijiang/Omni-Diffusion
model· 293 dl· ♡ 12
293 dl♡ 12

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling