BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi

TL;DR
BLIP-2 introduces an efficient vision-language pre-training approach that leverages frozen image encoders and language models, significantly reducing training costs while achieving state-of-the-art results on multiple tasks.
Contribution
It presents a novel two-stage pre-training method using frozen models and a lightweight transformer, enabling effective vision-language learning with fewer trainable parameters.
Findings
Outperforms Flamingo80B by 8.7% on zero-shot VQAv2
Achieves state-of-the-art results on various vision-language tasks
Demonstrates emerging zero-shot image-to-text generation capabilities
Abstract
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Salesforce/blip2-flan-t5-xxlmodel· 1.8k dl· ♡ 941.8k dl♡ 94
- 🤗Salesforce/blip2-opt-2.7bmodel· 540k dl· ♡ 436540k dl♡ 436
- 🤗Salesforce/blip2-flan-t5-xlmodel· 73k dl· ♡ 9173k dl♡ 91
- 🤗Salesforce/blip2-opt-6.7bmodel· 34k dl· ♡ 8034k dl♡ 80
- 🤗Salesforce/blip2-opt-2.7b-cocomodel· 344k dl· ♡ 11344k dl♡ 11
- 🤗Salesforce/blip2-opt-6.7b-cocomodel· 625 dl· ♡ 34625 dl♡ 34
- 🤗Salesforce/blip2-flan-t5-xl-cocomodel· 542 dl· ♡ 16542 dl♡ 16
- 🤗memegpt/blip2_endpointmodel· 6 dl· ♡ 46 dl♡ 4
- 🤗zai-org/visualglm-6bmodel· 169 dl· ♡ 210169 dl♡ 210
- 🤗kpyu/video-blip-opt-2.7b-ego4dmodel· 691 dl· ♡ 20691 dl♡ 20
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Softmax · Absolute Position Encodings · Byte Pair Encoding · Adam · Layer Normalization · Label Smoothing · Multi-Head Attention · Dense Connections
