Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga, Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin,, Candace Ross, Adam Polyak, Russell Howes, Vasu Sharma, Puxin Xu, Hovhannes, Tamoyan, Oron Ashual, Uriel Singer, Shang-Wen Li, Susan Zhang

TL;DR
CM3Leon is a scalable, retrieval-augmented multi-modal language model capable of high-quality text and image generation, fine-tuned for diverse tasks with less training compute.
Contribution
It introduces a novel training recipe for multi-modal models combining retrieval-augmented pretraining and supervised fine-tuning, achieving state-of-the-art results.
Findings
State-of-the-art text-to-image generation with less compute
High controllability in image editing and generation tasks
Effective multi-modal training recipe
Abstract
We present CM3Leon (pronounced "Chameleon"), a retrieval-augmented, token-based, decoder-only multi-modal language model capable of generating and infilling both text and images. CM3Leon uses the CM3 multi-modal architecture but additionally shows the extreme benefits of scaling up and tuning on more diverse instruction-style data. It is the first multi-modal model trained with a recipe adapted from text-only language models, including a large-scale retrieval-augmented pre-training stage and a second multi-task supervised fine-tuning (SFT) stage. It is also a general-purpose model that can do both text-to-image and image-to-text generation, allowing us to introduce self-contained contrastive decoding methods that produce high-quality outputs. Extensive experiments demonstrate that this recipe is highly effective for multi-modal models. CM3Leon achieves state-of-the-art performance in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsShrink and Fine-Tune
