Multimodal Foundation Models: From Specialists to General-Purpose Assistants
Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan, Wang, Jianfeng Gao

TL;DR
This survey reviews the evolution of multimodal foundation models from specialized systems to versatile, general-purpose assistants, highlighting recent advances in unified models, multimodal LLMs, and tool chaining.
Contribution
It provides a comprehensive taxonomy and analysis of both established and emerging research areas in multimodal foundation models, emphasizing their transition to general-purpose AI assistants.
Findings
Survey of well-established multimodal models for visual understanding and text-to-image generation.
Analysis of recent advances in unified vision models inspired by large language models.
Discussion of end-to-end training of multimodal LLMs and chaining multimodal tools.
Abstract
This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multimodal foundation models pre-trained for specific purposes, including two topics -- methods of learning vision backbones for visual understanding and text-to-image generation. (ii) Then, we present recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics -- unified vision models inspired by large language models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗saurabh-straive/llava_100k_finetunedmodel
- 🤗Straive/llava-1.5-13b-lora-100k-8-marmodel
- 🤗saurabh-straive/llava-1-5model
- 🤗GDinesh/llava-1-5model
- 🤗starriver030515/LLaVAmodel
- 🤗mylesgoose/Llama-3.1-Minitron-4B-Llava-Nvidia-siglip-ovmodel· ♡ 1♡ 1
- 🤗gradguy/model1model· ♡ 1♡ 1
- 🤗chouss/llava-spatmodel
- 🤗zooblastlbz/id-alignmodel
- 🤗YuqianFu/LLaVAmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
