MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning
Zijia Zhao, Longteng Guo, Xingjian He, Shuai Shao, Zehuan Yuan, Jing, Liu

TL;DR
MAMO introduces a masked multimodal modeling approach that enhances fine-grained vision-language representations by jointly masking inputs and predicting both implicit and explicit targets, leading to improved performance on multiple tasks.
Contribution
The paper presents a novel joint masking strategy with dual targets to learn fine-grained multimodal interactions, bridging the semantic gap in vision-language models.
Findings
Achieves state-of-the-art results on image-text retrieval and VQA.
Effectively learns fine-grained multimodal interactions.
Improves zero-shot and fine-tuned task performance.
Abstract
Multimodal representation learning has shown promising improvements on various vision-language tasks. Most existing methods excel at building global-level alignment between vision and language while lacking effective fine-grained image-text interaction. In this paper, we propose a jointly masked multimodal modeling method to learn fine-grained multimodal representations. Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover. The implicit target provides a unified and debiased objective for vision and language, where the model predicts latent multimodal representations of the unmasked input. The explicit target further enriches the multimodal representations by recovering high-level and semantically meaningful information: momentum visual features of image patches and concepts of word tokens. Through such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
