UNITER: UNiversal Image-TExt Representation Learning

Yen-Chun Chen; Linjie Li; Licheng Yu; Ahmed El Kholy; Faisal Ahmed,; Zhe Gan; Yu Cheng; Jingjing Liu

arXiv:1909.11740·cs.CV·July 21, 2020·184 cites

UNITER: UNiversal Image-TExt Representation Learning

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed,, Zhe Gan, Yu Cheng, Jingjing Liu

PDF

Open Access 5 Repos

TL;DR

UNITER introduces a large-scale pre-trained model for joint image-text understanding, achieving state-of-the-art results across multiple vision-and-language tasks by employing novel conditional masking and fine-grained alignment techniques.

Contribution

The paper presents UNITER, a universal model for image-text representation learning with innovative pre-training tasks and alignment methods, significantly improving performance on diverse V+L tasks.

Findings

01

UNITER achieves new state-of-the-art results on six V+L tasks.

02

Conditional masking and OT-based WRA improve pre-training effectiveness.

03

Extensive ablations identify optimal pre-training task combinations.

Abstract

Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsUNiversal Image-TExt Representation Learning