From Efficient Multimodal Models to World Models: A Survey

Xinji Mai; Zeng Tao; Junxiong Lin; Haoran Wang; Yang Chang; Yanlan; Kang; Yan Wang; Wenqiang Zhang

arXiv:2407.00118·cs.LG·July 2, 2024·2 cites

From Efficient Multimodal Models to World Models: A Survey

Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan, Kang, Yan Wang, Wenqiang Zhang

PDF

Open Access

TL;DR

This survey reviews recent progress in multimodal large models, their techniques, applications, and challenges, emphasizing their potential to develop world models and achieve artificial general intelligence.

Contribution

It provides a comprehensive overview of key multimodal techniques, discusses integration challenges, and proposes future research directions for advancing multimodal models towards world models.

Findings

01

Multimodal models are advancing with techniques like M-COT, M-IT, and M-ICL.

02

Developments highlight potential for artificial general intelligence.

03

Challenges include unifying multimodal architectures and integrating external reasoning systems.

Abstract

Multimodal Large Models (MLMs) are becoming a significant research focus, combining powerful large language models with multimodal learning to perform complex tasks across different data modalities. This review explores the latest developments and challenges in MLMs, emphasizing their potential in achieving artificial general intelligence and as a pathway to world models. We provide an overview of key techniques such as Multimodal Chain of Thought (M-COT), Multimodal Instruction Tuning (M-IT), and Multimodal In-Context Learning (M-ICL). Additionally, we discuss both the fundamental and specific technologies of multimodal models, highlighting their applications, input/output modalities, and design characteristics. Despite significant advancements, the development of a unified multimodal model remains elusive. We discuss the integration of 3D generation and embodied intelligence to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Semantic Web and Ontologies · Multi-Agent Systems and Negotiation