From Efficient Multimodal Models to World Models: A Survey
Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan, Kang, Yan Wang, Wenqiang Zhang

TL;DR
This survey reviews recent progress in multimodal large models, their techniques, applications, and challenges, emphasizing their potential to develop world models and achieve artificial general intelligence.
Contribution
It provides a comprehensive overview of key multimodal techniques, discusses integration challenges, and proposes future research directions for advancing multimodal models towards world models.
Findings
Multimodal models are advancing with techniques like M-COT, M-IT, and M-ICL.
Developments highlight potential for artificial general intelligence.
Challenges include unifying multimodal architectures and integrating external reasoning systems.
Abstract
Multimodal Large Models (MLMs) are becoming a significant research focus, combining powerful large language models with multimodal learning to perform complex tasks across different data modalities. This review explores the latest developments and challenges in MLMs, emphasizing their potential in achieving artificial general intelligence and as a pathway to world models. We provide an overview of key techniques such as Multimodal Chain of Thought (M-COT), Multimodal Instruction Tuning (M-IT), and Multimodal In-Context Learning (M-ICL). Additionally, we discuss both the fundamental and specific technologies of multimodal models, highlighting their applications, input/output modalities, and design characteristics. Despite significant advancements, the development of a unified multimodal model remains elusive. We discuss the integration of 3D generation and embodied intelligence to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Semantic Web and Ontologies · Multi-Agent Systems and Negotiation
