BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image   Encoders and Large Language Models

Junnan Li; Dongxu Li; Silvio Savarese; Steven Hoi

arXiv:2301.12597·cs.CV·June 16, 2023·911 cites

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi

PDF

Open Access 5 Repos 10 Models 1 Datasets 1 Video

TL;DR

BLIP-2 introduces an efficient vision-language pre-training approach that leverages frozen image encoders and language models, significantly reducing training costs while achieving state-of-the-art results on multiple tasks.

Contribution

It presents a novel two-stage pre-training method using frozen models and a lightweight transformer, enabling effective vision-language learning with fewer trainable parameters.

Findings

01

Outperforms Flamingo80B by 8.7% on zero-shot VQAv2

02

Achieves state-of-the-art results on various vision-language tasks

03

Demonstrates emerging zero-shot image-to-text generation capabilities

Abstract

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

py-img-gen/ukiyo-e-face-blip2-captions
dataset· 33 dl
33 dl

Videos

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Softmax · Absolute Position Encodings · Byte Pair Encoding · Adam · Layer Normalization · Label Smoothing · Multi-Head Attention · Dense Connections