A Survey on Multimodal Large Language Models
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, Enhong, Chen

TL;DR
This survey reviews recent advances in Multimodal Large Language Models (MLLMs), highlighting their architectures, capabilities, extensions, and challenges, emphasizing their potential to achieve artificial general intelligence.
Contribution
It provides a comprehensive overview of MLLM research, including concepts, techniques, and future directions, serving as a valuable resource for ongoing and future studies.
Findings
MLLMs exhibit emergent capabilities like image-based storytelling and math reasoning.
Research is rapidly advancing to extend MLLMs to more modalities and scenarios.
Challenges include multimodal hallucination and scalability issues.
Abstract
Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of MLLM, such as writing stories based on images and OCR-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even better than GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First of all, we present the basic formulation of MLLM and delineate its related concepts, including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
