Visual Instruction Tuning

Haotian Liu; Chunyuan Li; Qingyang Wu; Yong Jae Lee

arXiv:2304.08485·cs.CV·December 14, 2023·673 cites

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

PDF

Open Access 5 Repos 10 Models 5 Datasets 1 Video

TL;DR

This paper introduces LLaVA, a multimodal model trained on GPT-4 generated visual instruction data, achieving high performance in multimodal understanding and visual question answering tasks.

Contribution

It is the first to use language-only GPT-4 to generate multimodal instruction data for training a large vision-language model, LLaVA.

Findings

01

LLaVA exhibits impressive multimodal chat abilities.

02

Achieves 85.1% score on synthetic multimodal instruction dataset.

03

Sets a new state-of-the-art accuracy of 92.53% on Science QA.

Abstract

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

Visual Instruction Tuning· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization