MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language   Representation Learning

Zijia Zhao; Longteng Guo; Xingjian He; Shuai Shao; Zehuan Yuan; Jing; Liu

arXiv:2210.04183·cs.CV·June 16, 2023·1 cites

MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

Zijia Zhao, Longteng Guo, Xingjian He, Shuai Shao, Zehuan Yuan, Jing, Liu

PDF

Open Access

TL;DR

MAMO introduces a masked multimodal modeling approach that enhances fine-grained vision-language representations by jointly masking inputs and predicting both implicit and explicit targets, leading to improved performance on multiple tasks.

Contribution

The paper presents a novel joint masking strategy with dual targets to learn fine-grained multimodal interactions, bridging the semantic gap in vision-language models.

Findings

01

Achieves state-of-the-art results on image-text retrieval and VQA.

02

Effectively learns fine-grained multimodal interactions.

03

Improves zero-shot and fine-tuned task performance.

Abstract

Multimodal representation learning has shown promising improvements on various vision-language tasks. Most existing methods excel at building global-level alignment between vision and language while lacking effective fine-grained image-text interaction. In this paper, we propose a jointly masked multimodal modeling method to learn fine-grained multimodal representations. Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover. The implicit target provides a unified and debiased objective for vision and language, where the model predicts latent multimodal representations of the unmasked input. The explicit target further enriches the multimodal representations by recovering high-level and semantically meaningful information: momentum visual features of image patches and concepts of word tokens. Through such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques