VL-BEiT: Generative Vision-Language Pretraining
Hangbo Bao, Wenhui Wang, Li Dong, Furu Wei

TL;DR
VL-BEiT is a unified vision-language Transformer pretraining approach that uses masked prediction tasks on images and texts, achieving strong results across multiple vision-language benchmarks and transferable visual tasks.
Contribution
It introduces a simple, unified generative pretraining framework for vision-language models with a shared Transformer backbone trained from scratch.
Findings
Strong performance on vision-language tasks like VQA, reasoning, and retrieval.
Transferable visual features for classification and segmentation.
Effective one-stage training with a shared model for multiple tasks.
Abstract
We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining. Our minimalist solution conducts masked prediction on both monomodal and multimodal data with a shared Transformer. Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images. VL-BEiT is learned from scratch with one unified pretraining task, one shared backbone, and one-stage training. Our method is conceptually simple and empirically effective. Experimental results show that VL-BEiT obtains strong results on various vision-language benchmarks, such as visual question answering, visual reasoning, and image-text retrieval. Moreover, our method learns transferable visual features, achieving competitive performance on image classification, and semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Softmax · Dense Connections · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Multi-Head Attention · Absolute Position Encodings · Dropout
