MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling
Tarik Arici, Mehmet Saygin Seyfioglu, Tal Neiman, Yi Xu, Son Train,, Trishul Chilimbi, Belinda Zeng, and Ismail Tutar

TL;DR
This paper introduces MLIM, a simplified vision-and-language pre-training method using masked language and image modeling with modality-aware masking, improving downstream performance on multi-modal e-commerce data.
Contribution
MLIM combines MLM and image reconstruction losses with modality-aware masking to enhance cross-modal learning in a simplified VLP framework.
Findings
Better downstream task performance on e-commerce data
Effective use of MLM and image reconstruction losses
Simplified VLP methodology
Abstract
Vision-and-Language Pre-training (VLP) improves model performance for downstream tasks that require image and text inputs. Current VLP approaches differ on (i) model architecture (especially image embedders), (ii) loss functions, and (iii) masking policies. Image embedders are either deep models like ResNet or linear projections that directly feed image-pixels into the transformer. Typically, in addition to the Masked Language Modeling (MLM) loss, alignment-based objectives are used for cross-modality interaction, and RoI feature regression and classification tasks for Masked Image-Region Modeling (MIRM). Both alignment and MIRM objectives mostly do not have ground truth. Alignment-based objectives require pairings of image and text and heuristic objective functions. MIRM relies on object detectors. Masking policies either do not take advantage of multi-modality or are strictly coupled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsAttentive Walk-Aggregating Graph Neural Network · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Residual Connection · Average Pooling · Max Pooling · Residual Block · Bottleneck Residual Block · Global Average Pooling
