MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
William Berman, Alexander Peysakhovich

TL;DR
This paper introduces MUMU, a multimodal image generation model trained on semantically meaningful image crops, capable of composing and generalizing to tasks like style transfer from complex text-image prompts.
Contribution
MUMU is the first model to learn from interleaved text and image crops, enabling coherent image composition and style transfer from multimodal prompts.
Findings
MUMU can generate images combining elements from different images.
The model generalizes to style transfer and character consistency tasks.
Training on image crops allows effective multimodal image synthesis.
Abstract
We train a model to generate images from multimodal prompts of interleaved text and images such as "a <picture of a man> man and his <picture of a dog> dog in an <picture of a cartoon> animated style." We bootstrap a multimodal dataset by extracting semantically meaningful image crops corresponding to words in the image captions of synthetically generated and publicly available text-image data. Our model, MUMU, is composed of a vision-language model encoder with a diffusion decoder and is trained on a single 8xH100 GPU node. Despite being only trained on crops from the same image, MUMU learns to compose inputs from different images into a coherent output. For example, an input of a realistic person and a cartoon will output the same person in the cartoon style, and an input of a standing subject and a scooter will output the subject riding the scooter. As a result, our model generalizes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsDiffusion
