MLIM: Vision-and-Language Model Pre-training with Masked Language and   Image Modeling

Tarik Arici; Mehmet Saygin Seyfioglu; Tal Neiman; Yi Xu; Son Train,; Trishul Chilimbi; Belinda Zeng; and Ismail Tutar

arXiv:2109.12178·cs.CV·September 28, 2021

MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling

Tarik Arici, Mehmet Saygin Seyfioglu, Tal Neiman, Yi Xu, Son Train,, Trishul Chilimbi, Belinda Zeng, and Ismail Tutar

PDF

Open Access

TL;DR

This paper introduces MLIM, a simplified vision-and-language pre-training method using masked language and image modeling with modality-aware masking, improving downstream performance on multi-modal e-commerce data.

Contribution

MLIM combines MLM and image reconstruction losses with modality-aware masking to enhance cross-modal learning in a simplified VLP framework.

Findings

01

Better downstream task performance on e-commerce data

02

Effective use of MLM and image reconstruction losses

03

Simplified VLP methodology

Abstract

Vision-and-Language Pre-training (VLP) improves model performance for downstream tasks that require image and text inputs. Current VLP approaches differ on (i) model architecture (especially image embedders), (ii) loss functions, and (iii) masking policies. Image embedders are either deep models like ResNet or linear projections that directly feed image-pixels into the transformer. Typically, in addition to the Masked Language Modeling (MLM) loss, alignment-based objectives are used for cross-modality interaction, and RoI feature regression and classification tasks for Masked Image-Region Modeling (MIRM). Both alignment and MIRM objectives mostly do not have ground truth. Alignment-based objectives require pairings of image and text and heuristic objective functions. MIRM relies on object detectors. Masking policies either do not take advantage of multi-modality or are strictly coupled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsAttentive Walk-Aggregating Graph Neural Network · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Residual Connection · Average Pooling · Max Pooling · Residual Block · Bottleneck Residual Block · Global Average Pooling