Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, Chaoyou Fu

TL;DR
Omni-Diffusion introduces a unified mask-based discrete diffusion model for multimodal understanding and generation across text, speech, and images, outperforming existing models on various benchmarks.
Contribution
It is the first multimodal language model built entirely on mask-based discrete diffusion, unifying understanding and generation across multiple modalities.
Findings
Outperforms or matches existing multimodal systems on benchmarks
Supports complex multimodal tasks involving multiple modalities
Demonstrates the potential of diffusion models for multimodal foundation modeling
Abstract
While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
