Masked Vision and Language Modeling for Multi-modal Representation   Learning

Gukyeong Kwon; Zhaowei Cai; Avinash Ravichandran; Erhan Bas; Rahul; Bhotika; Stefano Soatto

arXiv:2208.02131·cs.CV·March 16, 2023·24 cites

Masked Vision and Language Modeling for Multi-modal Representation Learning

Gukyeong Kwon, Zhaowei Cai, Avinash Ravichandran, Erhan Bas, Rahul, Bhotika, Stefano Soatto

PDF

Open Access 1 Video

TL;DR

This paper introduces a joint masked vision and language modeling approach that leverages cross-modal signals for improved multi-modal representation learning, achieving state-of-the-art results especially with limited data.

Contribution

It proposes a novel joint masked vision and language modeling method that uses cross-modal reconstruction to enhance multi-modal learning.

Findings

01

Achieves state-of-the-art performance on various V+L tasks with large-scale pre-training.

02

Outperforms competitors significantly in limited data scenarios.

03

Implicitly learns cross-modal alignment through masked signal reconstruction.

Abstract

In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality. This is motivated by the nature of image-text paired data that both of the image and the text convey almost the same information but in different formats. The masked signal reconstruction of one modality conditioned on another modality can also implicitly learn cross-modal alignment between language tokens and image patches. Our experiments on various V+L tasks show that the proposed method, along with common V+L alignment losses, achieves state-of-the-art performance in the regime of millions of pre-training data. Also, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Masked Vision and Language Modeling for Multi-modal Representation Learning· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques