InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi, Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi

TL;DR
InstructBLIP is a vision-language model trained with instruction tuning on diverse datasets, achieving state-of-the-art zero-shot and fine-tuned performance, advancing general-purpose multimodal AI capabilities.
Contribution
This work systematically studies vision-language instruction tuning, introduces an instruction-aware Query Transformer, and demonstrates significant performance improvements over existing models.
Findings
State-of-the-art zero-shot performance on 13 datasets
High accuracy on downstream tasks like ScienceQA (90.7%)
Models outperform BLIP-2 and Flamingo in various evaluations
Abstract
Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Salesforce/instructblip-vicuna-7bmodel· 12k dl· ♡ 9912k dl♡ 99
- 🤗Salesforce/instructblip-flan-t5-xlmodel· 9.6k dl· ♡ 309.6k dl♡ 30
- 🤗Salesforce/instructblip-flan-t5-xxlmodel· 289 dl· ♡ 21289 dl♡ 21
- 🤗Salesforce/instructblip-vicuna-13bmodel· 173 dl· ♡ 43173 dl♡ 43
- 🤗Mediocreatmybest/instructblip-flan-t5-xl_8bitmodel· 7 dl· ♡ 17 dl♡ 1
- 🤗stabilityai/japanese-instructblip-alphamodel· 43 dl· ♡ 5343 dl♡ 53
- 🤗Mediocreatmybest/instructblip-flan-t5-xxl_8bit_nf4model· 7 dl· ♡ 17 dl♡ 1
- 🤗Mediocreatmybest/instructblip-flan-t5-xl_8bit_nf4model· 8 dl8 dl
- 🤗benferns/instructblip-flan-t5-xl_8bit_nf4model· 2 dl2 dl
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer · Linear Layer · Absolute Position Encodings
