Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

Chao Liao; Liyang Liu; Xun Wang; Zhengxiong Luo; Xinyu Zhang; Wenliang Zhao; Jie Wu; Liang Li; Zhi Tian; Weilin Huang

arXiv:2505.05472·cs.CV·May 13, 2025

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, Weilin Huang

PDF

Open Access

TL;DR

Mogao is a novel omni-modal foundation model that enables interleaved multi-modal generation of text and images through a causal, unified framework with advanced architectural features and large-scale training.

Contribution

The paper introduces Mogao, a unified model for interleaved multi-modal generation, combining autoregressive and diffusion techniques with new architectural innovations and a large-scale training dataset.

Findings

01

Achieves state-of-the-art multi-modal understanding and generation.

02

Excels in zero-shot image editing and compositional tasks.

03

Produces high-quality, coherent interleaved outputs.

Abstract

Recent progress in unified models for image understanding and generation has been impressive, yet most approaches remain limited to single-modal generation conditioned on multiple modalities. In this paper, we present Mogao, a unified framework that advances this paradigm by enabling interleaved multi-modal generation through a causal approach. Mogao integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance, which allow it to harness the strengths of both autoregressive models for text generation and diffusion models for high-quality image synthesis. These practical improvements also make Mogao particularly effective to process interleaved sequences of text and images arbitrarily. To further unlock the potential of unified models, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsDiffusion · Sparse Evolutionary Training