Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification
Xin Wang, Yuwei Zhou, Bin Huang, Hong Chen, and Wenwu Zhu

TL;DR
This paper provides a comprehensive overview of multi-modal generative AI, focusing on multi-modal LLMs, diffusion models, and efforts toward unification for improved understanding and generation capabilities.
Contribution
It offers a detailed review of existing models and explores strategies for unifying multi-modal understanding and generation in AI systems.
Findings
Analyzes probabilistic modeling in multi-modal LLMs and diffusion models.
Discusses architectures like autoregressive and diffusion-based models.
Summarizes datasets used for multi-modal AI pretraining.
Abstract
Multi-modal generative AI (Artificial Intelligence) has attracted increasing attention from both academia and industry. Particularly, two dominant families of techniques have emerged: i) Multi-modal large language models (LLMs) demonstrate impressive ability for multi-modal understanding; and ii) Diffusion models exhibit remarkable multi-modal powers in terms of multi-modal generation. Therefore, this paper provides a comprehensive overview of multi-modal generative AI, including multi-modal LLMs, diffusions, and the unification for understanding and generation. To lay a solid foundation for unified models, we first provide a detailed review of both multi-modal LLMs and diffusion models respectively, including their probabilistic modeling procedure, multi-modal architecture design, and advanced applications to image/video LLMs as well as text-to-image/video generation. Furthermore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need · Diffusion
