X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha, Kembhavi

TL;DR
X-LXMERT extends the LXMERT model with training refinements to enable it to generate images from text, achieving competitive image generation performance while maintaining strong question answering and captioning abilities.
Contribution
The paper introduces X-LXMERT, a novel multimodal transformer that can generate images from text by applying specific training strategies, bridging discriminative and generative multimodal tasks.
Findings
X-LXMERT achieves state-of-the-art image generation performance.
It maintains comparable question answering and captioning abilities to LXMERT.
Training refinements are effective when applied to UNITER, creating X-UNITER.
Abstract
Mirroring the success of masked language models, vision-and-language counterparts like ViLBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and visual grounding. Recent work has also successfully adapted such models towards the generative task of image captioning. This begs the question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular representative from this model family - LXMERT - finds that it is unable to generate rich and semantically meaningful imagery with its current training setup. We introduce X-LXMERT, an extension to LXMERT with training refinements including: discretizing visual representations, using uniform masking with a large range of masking ratios and aligning the right pre-training datasets to the right objectives which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsLearning Cross-Modality Encoder Representations from Transformers · UNiversal Image-TExt Representation Learning · Vision-and-Language BERT
