Multimodal Masked Autoencoders Learn Transferable Representations
Xinyang Geng, Hao Liu, Lisa Lee, Dale Schuurmans, Sergey Levine,, Pieter Abbeel

TL;DR
This paper introduces M3AE, a scalable multimodal masked autoencoder that learns transferable representations from vision and language data without modality-specific encoders or contrastive learning, outperforming existing methods.
Contribution
The paper proposes M3AE, a simple unified architecture trained via masked token prediction, capable of leveraging both paired and unpaired multimodal data for transferable representations.
Findings
M3AE learns generalizable representations for downstream tasks.
Higher text mask ratios (50-90%) improve training due to joint modality learning.
M3AE scales well with larger models and training on unpaired data.
Abstract
Building scalable models to learn from diverse, multimodal data remains an open challenge. For vision-language data, the dominant approaches are based on contrastive learning objectives that train a separate encoder for each modality. While effective, contrastive learning approaches introduce sampling bias depending on the data augmentations used, which can degrade performance on downstream tasks. Moreover, these methods are limited to paired image-text data, and cannot leverage widely-available unpaired data. In this paper, we investigate whether a large multimodal model trained purely via masked token prediction, without using modality-specific encoders or contrastive learning, can learn transferable representations for downstream tasks. We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE), which learns a unified encoder for both vision and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Cancer-related molecular mechanisms research
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · WordPiece · Multi-Head Attention · Attention Dropout · Linear Warmup With Linear Decay
