CM3: A Causal Masked Multimodal Model of the Internet
Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu, Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, Luke, Zettlemoyer

TL;DR
CM3 introduces a causally masked multimodal model trained on web and Wikipedia data, capable of generating and understanding complex text and image content in a zero-shot setting, advancing multimodal AI capabilities.
Contribution
The paper presents a novel causally masked generative model that combines causal and bidirectional training for multimodal data, enabling versatile zero-shot tasks across text and images.
Findings
Achieves state-of-the-art zero-shot summarization, entity linking, and disambiguation.
Can generate images conditioned on text and perform captioning in a zero-shot manner.
Maintains competitive performance in fine-tuned multimodal tasks.
Abstract
We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents that can contain both text and image tokens. Our new causally masked approach generates tokens left to right while also masking out a small number of long token spans that are generated at the end of the string, instead of their original positions. The casual masking object provides a type of hybrid of the more common causal and masked language models, by enabling full generative modeling while also providing bidirectional context when generating the masked spans. We train causally masked language-image models on large-scale web and Wikipedia articles, where each document contains all of the text, hypertext markup, hyperlinks, and image tokens (from a VQVAE-GAN), provided in the order they appear in the original HTML source (before masking). The resulting CM3…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
CM3: A Causal Masked Multimodal Model of the Internet (Paper Explained w/ Author Interview)· youtube
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
