VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Weijie Su; Xizhou Zhu; Yue Cao; Bin Li; Lewei Lu; Furu Wei; Jifeng Dai

arXiv:1908.08530·cs.CV·February 19, 2020·784 cites

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai

PDF

Open Access 3 Repos 1 Video

TL;DR

VL-BERT introduces a pre-trained Transformer model that integrates visual and linguistic features for various visual-linguistic tasks, achieving state-of-the-art results on benchmarks like VCR.

Contribution

The paper presents a novel pre-training approach for a unified visual-linguistic Transformer model, extending BERT to handle both image regions and text.

Findings

01

VL-BERT outperforms previous models on visual question answering.

02

Pre-training on large-scale datasets improves downstream task performance.

03

Achieved first place on VCR benchmark with a single model.

Abstract

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on the massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Pre-training of BERT-based Transformer architectures explained – language and vision!· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsLinear Layer · Visual-Linguistic BERT · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia?