A Survey of Generative Categories and Techniques in Multimodal Generative Models
Longzhen Han, Awes Mubarak, Almas Baimagambetov, Nikolaos Polatidis, Thar Baker

TL;DR
This survey comprehensively reviews the evolution, techniques, models, and challenges of multimodal generative models across various output modalities, emphasizing foundational methods, evaluation, and ethical considerations.
Contribution
It introduces a unified taxonomy, evaluation framework, and analysis of cross-modal techniques, addressing current gaps in capability, safety, and governance of multimodal generative models.
Findings
Identifies six primary generative modalities and key techniques enabling cross-modal capabilities.
Proposes a unified evaluation framework focusing on faithfulness, compositionality, and robustness.
Highlights ethical risks like bias, privacy issues, and misuse, with mitigation strategies.
Abstract
Multimodal Generative Models (MGMs) have rapidly evolved beyond text generation, now spanning diverse output modalities including images, music, video, human motion, and 3D objects, by integrating language with other sensory modalities under unified architectures. This survey categorises six primary generative modalities and examines how foundational techniques, namely Self-Supervised Learning (SSL), Mixture of Experts (MoE), Reinforcement Learning from Human Feedback (RLHF), and Chain-of-Thought (CoT) prompting, enable cross-modal capabilities. We analyze key models, architectural trends, and emergent cross-modal synergies, while highlighting transferable techniques and unresolved challenges. Building on a common taxonomy of models and training recipes, we propose a unified evaluation framework centred on faithfulness, compositionality, and robustness, and synthesise evidence from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsDiffusion
