Unsupervised Vision-and-Language Pre-training via Retrieval-based   Multi-Granular Alignment

Mingyang Zhou; Licheng Yu; Amanpreet Singh; Mengjiao Wang; Zhou Yu,; Ning Zhang

arXiv:2203.00242·cs.CV·March 2, 2022·1 cites

Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment

Mingyang Zhou, Licheng Yu, Amanpreet Singh, Mengjiao Wang, Zhou Yu,, Ning Zhang

PDF

Open Access

TL;DR

This paper introduces an unsupervised vision-and-language pre-training method that leverages retrieval-based weak alignment and multi-granular tasks to learn cross-modal representations without parallel data, achieving state-of-the-art results.

Contribution

The paper proposes a novel unsupervised V+L pre-training curriculum using retrieval-based weak alignment and multi-granular alignment tasks, enabling effective cross-modal learning without parallel datasets.

Findings

01

Achieves state-of-the-art performance on VQA, NLVR2, Visual Entailment, and RefCOCO+ tasks.

02

Demonstrates the effectiveness of multi-granular alignment in unsupervised V+L pre-training.

03

Shows that joint image-and-text input and overall alignment are key factors for success.

Abstract

Vision-and-Language (V+L) pre-training models have achieved tremendous success in recent years on various multi-modal benchmarks. However, the majority of existing models require pre-training on a large set of parallel image-text data, which is costly to collect, compared to image-only or text-only data. In this paper, we explore unsupervised Vision-and-Language pre-training (UVLP) to learn the cross-modal representation from non-parallel image and text datasets. We found two key factors that lead to good unsupervised V+L pre-training without parallel data: (i) joint image-and-text input (ii) overall image-text alignment (even for non-parallel data). Accordingly, we propose a novel unsupervised V+L pre-training curriculum for non-parallel texts and images. We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques