Unified Vision-Language Pre-Training for Image Captioning and VQA
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso,, Jianfeng Gao

TL;DR
This paper introduces a unified vision-language pre-training model that uses a shared transformer for both understanding and generation tasks, achieving state-of-the-art results on image captioning and VQA benchmarks.
Contribution
It proposes the first unified model with a shared transformer for both encoding and decoding in vision-language tasks, trained with bidirectional and seq2seq objectives.
Findings
Achieves state-of-the-art results on COCO, Flickr30k, and VQA 2.0 datasets.
Unifies vision-language understanding and generation in a single model.
Demonstrates effectiveness of shared transformer architecture for diverse tasks.
Abstract
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsLinear Layer · Unified VLP · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam
