Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment
Mingyang Zhou, Licheng Yu, Amanpreet Singh, Mengjiao Wang, Zhou Yu,, Ning Zhang

TL;DR
This paper introduces an unsupervised vision-and-language pre-training method that leverages retrieval-based weak alignment and multi-granular tasks to learn cross-modal representations without parallel data, achieving state-of-the-art results.
Contribution
The paper proposes a novel unsupervised V+L pre-training curriculum using retrieval-based weak alignment and multi-granular alignment tasks, enabling effective cross-modal learning without parallel datasets.
Findings
Achieves state-of-the-art performance on VQA, NLVR2, Visual Entailment, and RefCOCO+ tasks.
Demonstrates the effectiveness of multi-granular alignment in unsupervised V+L pre-training.
Shows that joint image-and-text input and overall alignment are key factors for success.
Abstract
Vision-and-Language (V+L) pre-training models have achieved tremendous success in recent years on various multi-modal benchmarks. However, the majority of existing models require pre-training on a large set of parallel image-text data, which is costly to collect, compared to image-only or text-only data. In this paper, we explore unsupervised Vision-and-Language pre-training (UVLP) to learn the cross-modal representation from non-parallel image and text datasets. We found two key factors that lead to good unsupervised V+L pre-training without parallel data: (i) joint image-and-text input (ii) overall image-text alignment (even for non-parallel data). Accordingly, we propose a novel unsupervised V+L pre-training curriculum for non-parallel texts and images. We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
