X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal   Transformers

Jaemin Cho; Jiasen Lu; Dustin Schwenk; Hannaneh Hajishirzi; Aniruddha; Kembhavi

arXiv:2009.11278·cs.CV·September 24, 2020·24 cites

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha, Kembhavi

PDF

Open Access 2 Repos

TL;DR

X-LXMERT extends the LXMERT model with training refinements to enable it to generate images from text, achieving competitive image generation performance while maintaining strong question answering and captioning abilities.

Contribution

The paper introduces X-LXMERT, a novel multimodal transformer that can generate images from text by applying specific training strategies, bridging discriminative and generative multimodal tasks.

Findings

01

X-LXMERT achieves state-of-the-art image generation performance.

02

It maintains comparable question answering and captioning abilities to LXMERT.

03

Training refinements are effective when applied to UNITER, creating X-UNITER.

Abstract

Mirroring the success of masked language models, vision-and-language counterparts like ViLBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and visual grounding. Recent work has also successfully adapted such models towards the generative task of image captioning. This begs the question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular representative from this model family - LXMERT - finds that it is unable to generate rich and semantically meaningful imagery with its current training setup. We introduce X-LXMERT, an extension to LXMERT with training refinements including: discretizing visual representations, using uniform masking with a large range of masking ratios and aligning the right pre-training datasets to the right objectives which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsLearning Cross-Modality Encoder Representations from Transformers · UNiversal Image-TExt Representation Learning · Vision-and-Language BERT