Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

TL;DR
This paper introduces LLaVA, a multimodal model trained on GPT-4 generated visual instruction data, achieving high performance in multimodal understanding and visual question answering tasks.
Contribution
It is the first to use language-only GPT-4 to generate multimodal instruction data for training a large vision-language model, LLaVA.
Findings
LLaVA exhibits impressive multimodal chat abilities.
Achieves 85.1% score on synthetic multimodal instruction dataset.
Sets a new state-of-the-art accuracy of 92.53% on Science QA.
Abstract
Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗jingyaogong/minimind-3v-moemodel· 13 dl· ♡ 113 dl♡ 1
- 🤗mart9992/eri2model
- 🤗mart9992/nervnmodel· ♡ 5♡ 5
- 🤗mart9992/vierundvimodel
- 🤗saurabh-straive/llava_100k_finetunedmodel
- 🤗Straive/llava-1.5-13b-lora-100k-8-marmodel
- 🤗saurabh-straive/llava-1-5model
- 🤗GDinesh/llava-1-5model
- 🤗qresearch/llama-3-vision-alphamodel· ♡ 62♡ 62
- 🤗qresearch/llama-3-vision-alpha-hfmodel· 12 dl· ♡ 5612 dl♡ 56
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization
