UNITER: UNiversal Image-TExt Representation Learning
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed,, Zhe Gan, Yu Cheng, Jingjing Liu

TL;DR
UNITER introduces a large-scale pre-trained model for joint image-text understanding, achieving state-of-the-art results across multiple vision-and-language tasks by employing novel conditional masking and fine-grained alignment techniques.
Contribution
The paper presents UNITER, a universal model for image-text representation learning with innovative pre-training tasks and alignment methods, significantly improving performance on diverse V+L tasks.
Findings
UNITER achieves new state-of-the-art results on six V+L tasks.
Conditional masking and OT-based WRA improve pre-training effectiveness.
Extensive ablations identify optimal pre-training task combinations.
Abstract
Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsUNiversal Image-TExt Representation Learning
