BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi

TL;DR
BLIP introduces a flexible vision-language pre-training framework that leverages noisy web data through bootstrapping, achieving state-of-the-art results across understanding and generation tasks, and demonstrating strong zero-shot transfer to video-language tasks.
Contribution
BLIP presents a novel bootstrapping approach for VLP that effectively utilizes noisy web data and unifies vision-language understanding and generation tasks.
Findings
State-of-the-art results on image-text retrieval, captioning, and VQA.
Effective use of noisy web data via caption generation and filtering.
Strong zero-shot transfer performance to video-language tasks.
Abstract
Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Salesforce/blip-image-captioning-largemodel· 1.6M dl· ♡ 14611.6M dl♡ 1461
- 🤗Salesforce/blip-vqa-basemodel· 597k dl· ♡ 189597k dl♡ 189
- 🤗AhmedSSabir/BERT-CNN-Visual-Semanticmodel
- 🤗Salesforce/blip-image-captioning-basemodel· 3.0M dl· ♡ 8463.0M dl♡ 846
- 🤗Salesforce/blip-itm-base-cocomodel· 139k dl· ♡ 28139k dl♡ 28
- 🤗Salesforce/blip-vqa-capfilt-largemodel· 18k dl· ♡ 5318k dl♡ 53
- 🤗Salesforce/blip-itm-large-cocomodel· 4.2k dl· ♡ 24.2k dl♡ 2
- 🤗Salesforce/blip-itm-base-flickrmodel· 289 dl· ♡ 2289 dl♡ 2
- 🤗Salesforce/blip-itm-large-flickrmodel· 251 dl· ♡ 3251 dl♡ 3
- 🤗ybelkada/blip-image-captioning-base-football-finetunedmodel· 38 dl· ♡ 238 dl♡ 2
Videos
One Model For All The Tasks - BLIP (Author Interview)· youtube
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding&Generation· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsBLIP: Bootstrapping Language-Image Pre-training
