Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee

TL;DR
This paper demonstrates that simple modifications to LLaVA, including a powerful vision-language connector and targeted data, significantly improve multimodal model performance across multiple benchmarks with minimal data and training time.
Contribution
The authors introduce effective modifications to LLaVA, establishing stronger, data-efficient baselines that achieve state-of-the-art results in visual instruction tuning.
Findings
Achieved state-of-the-art across 11 benchmarks.
Used only 1.2M publicly available data.
Completed training in approximately 1 day on a single 8-A100 node.
Abstract
Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗jingyaogong/minimind-3v-moemodel· 13 dl· ♡ 113 dl♡ 1
- 🤗stabilityai/japanese-stable-vlmmodel· 10 dl· ♡ 5210 dl♡ 52
- 🤗circulus/TinyHawk-v1model· 2 dl2 dl
- 🤗llava-hf/llava-v1.6-mistral-7b-hfmodel· 597k dl· ♡ 304597k dl♡ 304
- 🤗saurabh-straive/llava_100k_finetunedmodel
- 🤗Straive/llava-1.5-13b-lora-100k-8-marmodel
- 🤗Intel/llava-gemma-2bmodel· 6.1k dl· ♡ 486.1k dl♡ 48
- 🤗llava-hf/llava-v1.6-vicuna-7b-hfmodel· 23k dl· ♡ 3023k dl♡ 30
- 🤗llava-hf/llava-v1.6-34b-hfmodel· 10k dl· ♡ 9310k dl♡ 93
- 🤗llava-hf/llava-v1.6-vicuna-13b-hfmodel· 2.8k dl· ♡ 222.8k dl♡ 22
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques
