BLIP: Bootstrapping Language-Image Pre-training for Unified   Vision-Language Understanding and Generation

Junnan Li; Dongxu Li; Caiming Xiong; Steven Hoi

arXiv:2201.12086·cs.CV·February 16, 2022·864 cites

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi

PDF

Open Access 5 Repos 10 Models 1 Datasets 2 Videos

TL;DR

BLIP introduces a flexible vision-language pre-training framework that leverages noisy web data through bootstrapping, achieving state-of-the-art results across understanding and generation tasks, and demonstrating strong zero-shot transfer to video-language tasks.

Contribution

BLIP presents a novel bootstrapping approach for VLP that effectively utilizes noisy web data and unifies vision-language understanding and generation tasks.

Findings

01

State-of-the-art results on image-text retrieval, captioning, and VQA.

02

Effective use of noisy web data via caption generation and filtering.

03

Strong zero-shot transfer performance to video-language tasks.

Abstract

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score).…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

lianghsun/pokemon-blip-captions-en-zh_tw
dataset· 38 dl
38 dl

Videos

One Model For All The Tasks - BLIP (Author Interview)· youtube

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding&Generation· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsBLIP: Bootstrapping Language-Image Pre-training