Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification

Xin Wang; Yuwei Zhou; Bin Huang; Hong Chen; and Wenwu Zhu

arXiv:2409.14993·cs.AI·November 26, 2025

Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification

Xin Wang, Yuwei Zhou, Bin Huang, Hong Chen, and Wenwu Zhu

PDF

TL;DR

This paper provides a comprehensive overview of multi-modal generative AI, focusing on multi-modal LLMs, diffusion models, and efforts toward unification for improved understanding and generation capabilities.

Contribution

It offers a detailed review of existing models and explores strategies for unifying multi-modal understanding and generation in AI systems.

Findings

01

Analyzes probabilistic modeling in multi-modal LLMs and diffusion models.

02

Discusses architectures like autoregressive and diffusion-based models.

03

Summarizes datasets used for multi-modal AI pretraining.

Abstract

Multi-modal generative AI (Artificial Intelligence) has attracted increasing attention from both academia and industry. Particularly, two dominant families of techniques have emerged: i) Multi-modal large language models (LLMs) demonstrate impressive ability for multi-modal understanding; and ii) Diffusion models exhibit remarkable multi-modal powers in terms of multi-modal generation. Therefore, this paper provides a comprehensive overview of multi-modal generative AI, including multi-modal LLMs, diffusions, and the unification for understanding and generation. To lay a solid foundation for unified models, we first provide a detailed review of both multi-modal LLMs and diffusion models respectively, including their probabilistic modeling procedure, multi-modal architecture design, and advanced applications to image/video LLMs as well as text-to-image/video generation. Furthermore, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need · Diffusion