Data Efficient Masked Language Modeling for Vision and Language

Yonatan Bitton; Gabriel Stanovsky; Michael Elhadad; Roy Schwartz

arXiv:2109.02040·cs.CL·September 7, 2021

Data Efficient Masked Language Modeling for Vision and Language

Yonatan Bitton, Gabriel Stanovsky, Michael Elhadad, Roy Schwartz

PDF

Open Access 1 Repo

TL;DR

This paper proposes alternative masking strategies for vision-language masked language modeling that improve data efficiency and downstream task performance by better utilizing training data and enhancing cross-modal fusion.

Contribution

The paper introduces novel masking strategies tailored for vision-language pretraining, addressing limitations of traditional MLM in this setting.

Findings

01

Improved performance on downstream tasks with new masking strategies.

02

Enhanced utilization of training data, especially in low-resource scenarios.

03

Significant outperformance on a prompt-based object recognition task.

Abstract

Masked language modeling (MLM) is one of the key sub-tasks in vision-language pretraining. In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text. In this paper, we observe several key disadvantages of MLM in this setting. First, as captions tend to be short, in a third of the sentences no token is sampled. Second, the majority of masked tokens are stop-words and punctuation, leading to under-utilization of the image. We investigate a range of alternative masking strategies specific to the cross-modal setting that address these shortcomings, aiming for better fusion of text and image in the learned representation. When pre-training the LXMERT model, our alternative masking strategies consistently improve over the original masking strategy on three downstream tasks, especially in low resource…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yonatanbitton/data_efficient_masked_language_modeling_for_vision_and_language
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsLearning Cross-Modality Encoder Representations from Transformers