Data Efficient Masked Language Modeling for Vision and Language
Yonatan Bitton, Gabriel Stanovsky, Michael Elhadad, Roy Schwartz

TL;DR
This paper proposes alternative masking strategies for vision-language masked language modeling that improve data efficiency and downstream task performance by better utilizing training data and enhancing cross-modal fusion.
Contribution
The paper introduces novel masking strategies tailored for vision-language pretraining, addressing limitations of traditional MLM in this setting.
Findings
Improved performance on downstream tasks with new masking strategies.
Enhanced utilization of training data, especially in low-resource scenarios.
Significant outperformance on a prompt-based object recognition task.
Abstract
Masked language modeling (MLM) is one of the key sub-tasks in vision-language pretraining. In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text. In this paper, we observe several key disadvantages of MLM in this setting. First, as captions tend to be short, in a third of the sentences no token is sampled. Second, the majority of masked tokens are stop-words and punctuation, leading to under-utilization of the image. We investigate a range of alternative masking strategies specific to the cross-modal setting that address these shortcomings, aiming for better fusion of text and image in the learned representation. When pre-training the LXMERT model, our alternative masking strategies consistently improve over the original masking strategy on three downstream tasks, especially in low resource…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsLearning Cross-Modality Encoder Representations from Transformers
