LXMERT: Learning Cross-Modality Encoder Representations from   Transformers

Hao Tan; Mohit Bansal

arXiv:1908.07490·cs.CL·December 5, 2019·223 cites

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Hao Tan, Mohit Bansal

PDF

Open Access 5 Repos 2 Datasets

TL;DR

LXMERT introduces a large-scale Transformer-based framework that learns cross-modal representations from image and text data, achieving state-of-the-art results in visual question answering and reasoning tasks through extensive pre-training and fine-tuning.

Contribution

The paper presents a novel multi-encoder Transformer model with diverse pre-training tasks for vision-and-language understanding, advancing cross-modal reasoning capabilities.

Findings

01

Achieves state-of-the-art on VQA and GQA datasets.

02

Improves NLVR2 accuracy by 22% absolute.

03

Demonstrates effectiveness of pre-training and model components.

Abstract

Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification), cross-modality matching, and image question answering. These tasks help in learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsLinear Layer · Learning Cross-Modality Encoder Representations from Transformers · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam