Multimodal Foundation Models: From Specialists to General-Purpose   Assistants

Chunyuan Li; Zhe Gan; Zhengyuan Yang; Jianwei Yang; Linjie Li; Lijuan; Wang; Jianfeng Gao

arXiv:2309.10020·cs.CV·September 20, 2023·24 cites

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan, Wang, Jianfeng Gao

PDF

Open Access 1 Repo 10 Models

TL;DR

This survey reviews the evolution of multimodal foundation models from specialized systems to versatile, general-purpose assistants, highlighting recent advances in unified models, multimodal LLMs, and tool chaining.

Contribution

It provides a comprehensive taxonomy and analysis of both established and emerging research areas in multimodal foundation models, emphasizing their transition to general-purpose AI assistants.

Findings

01

Survey of well-established multimodal models for visual understanding and text-to-image generation.

02

Analysis of recent advances in unified vision models inspired by large language models.

03

Discussion of end-to-end training of multimodal LLMs and chaining multimodal tools.

Abstract

This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multimodal foundation models pre-trained for specific purposes, including two topics -- methods of learning vision backbones for visual understanding and text-to-image generation. (ii) Then, we present recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics -- unified vision models inspired by large language models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

computer-vision-in-the-wild/cvinw_readings
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques