InstructBLIP: Towards General-purpose Vision-Language Models with   Instruction Tuning

Wenliang Dai; Junnan Li; Dongxu Li; Anthony Meng Huat Tiong; Junqi; Zhao; Weisheng Wang; Boyang Li; Pascale Fung; Steven Hoi

arXiv:2305.06500·cs.CV·June 16, 2023·403 cites

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi, Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi

PDF

Open Access 4 Repos 9 Models 1 Video

TL;DR

InstructBLIP is a vision-language model trained with instruction tuning on diverse datasets, achieving state-of-the-art zero-shot and fine-tuned performance, advancing general-purpose multimodal AI capabilities.

Contribution

This work systematically studies vision-language instruction tuning, introduces an instruction-aware Query Transformer, and demonstrates significant performance improvements over existing models.

Findings

01

State-of-the-art zero-shot performance on 13 datasets

02

High accuracy on downstream tasks like ScienceQA (90.7%)

03

Models outperform BLIP-2 and Flamingo in various evaluations

Abstract

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer · Linear Layer · Absolute Position Encodings