VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai

TL;DR
VL-BERT introduces a pre-trained Transformer model that integrates visual and linguistic features for various visual-linguistic tasks, achieving state-of-the-art results on benchmarks like VCR.
Contribution
The paper presents a novel pre-training approach for a unified visual-linguistic Transformer model, extending BERT to handle both image regions and text.
Findings
VL-BERT outperforms previous models on visual question answering.
Pre-training on large-scale datasets improves downstream task performance.
Achieved first place on VCR benchmark with a single model.
Abstract
We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on the massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Pre-training of BERT-based Transformer architectures explained – language and vision!· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsLinear Layer · Visual-Linguistic BERT · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia?
