Visual Instruction Tuning towards General-Purpose Multimodal Model: A   Survey

Jiaxing Huang; Jingyi Zhang; Kai Jiang; Han Qiu; Shijian Lu

arXiv:2312.16602·cs.CV·December 29, 2023·1 cites

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

Jiaxing Huang, Jingyi Zhang, Kai Jiang, Han Qiu, Shijian Lu

PDF

Open Access

TL;DR

This survey reviews the development, methodologies, datasets, and challenges of visual instruction tuning, a technique that enables large vision models to follow arbitrary language instructions for diverse tasks.

Contribution

It provides a comprehensive systematic review of visual instruction tuning, categorizing existing methods, and discussing future research directions.

Findings

01

VIT enables models to follow arbitrary instructions across tasks.

02

Existing VIT methods vary in architecture and training objectives.

03

Challenges include dataset diversity and model generalization.

Abstract

Traditional computer vision generally solves each single task independently by a dedicated model with the task instruction implicitly designed in the model architecture, arising two limitations: (1) it leads to task-specific models, which require multiple models for different tasks and restrict the potential synergies from diverse tasks; (2) it leads to a pre-defined and fixed model interface that has limited interactivity and adaptability in following user' task instructions. To address them, Visual Instruction Tuning (VIT) has been intensively studied recently, which finetunes a large vision model with language as task instructions, aiming to learn from a wide range of vision tasks described by language instructions a general-purpose multimodal model that can follow arbitrary instructions and thus solve arbitrary tasks specified by the user. This work aims to provide a systematic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications