Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
Jiaxing Huang, Jingyi Zhang, Kai Jiang, Han Qiu, Shijian Lu

TL;DR
This survey reviews the development, methodologies, datasets, and challenges of visual instruction tuning, a technique that enables large vision models to follow arbitrary language instructions for diverse tasks.
Contribution
It provides a comprehensive systematic review of visual instruction tuning, categorizing existing methods, and discussing future research directions.
Findings
VIT enables models to follow arbitrary instructions across tasks.
Existing VIT methods vary in architecture and training objectives.
Challenges include dataset diversity and model generalization.
Abstract
Traditional computer vision generally solves each single task independently by a dedicated model with the task instruction implicitly designed in the model architecture, arising two limitations: (1) it leads to task-specific models, which require multiple models for different tasks and restrict the potential synergies from diverse tasks; (2) it leads to a pre-defined and fixed model interface that has limited interactivity and adaptability in following user' task instructions. To address them, Visual Instruction Tuning (VIT) has been intensively studied recently, which finetunes a large vision model with language as task instructions, aiming to learn from a wide range of vision tasks described by language instructions a general-purpose multimodal model that can follow arbitrary instructions and thus solve arbitrary tasks specified by the user. This work aims to provide a systematic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
