Improved Baselines with Visual Instruction Tuning

Haotian Liu; Chunyuan Li; Yuheng Li; Yong Jae Lee

arXiv:2310.03744·cs.CV·May 17, 2024·64 cites

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee

PDF

Open Access 5 Repos 10 Models 5 Datasets

TL;DR

This paper demonstrates that simple modifications to LLaVA, including a powerful vision-language connector and targeted data, significantly improve multimodal model performance across multiple benchmarks with minimal data and training time.

Contribution

The authors introduce effective modifications to LLaVA, establishing stronger, data-efficient baselines that achieve state-of-the-art results in visual instruction tuning.

Findings

01

Achieved state-of-the-art across 11 benchmarks.

02

Used only 1.2M publicly available data.

03

Completed training in approximately 1 day on a single 8-A100 node.

Abstract

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques