Generating Images with Multimodal Language Models
Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov

TL;DR
This paper introduces a multimodal model that combines large language models with visual encoders and decoders, enabling complex image and text generation, retrieval, and dialogue, surpassing previous models in handling longer, more complex language inputs.
Contribution
It presents a novel method to fuse frozen LLMs with visual models via embedding space mapping, allowing flexible multimodal input processing and output generation.
Findings
Outperforms baseline models on complex language tasks
Capable of image retrieval and generation from interleaved inputs
Outperforms non-LLM models in text-to-image tasks
Abstract
We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to generate coherent image (and text) outputs. To achieve strong performance on image generation, we propose an efficient mapping network to ground the LLM to an off-the-shelf text-to-image generation model. This mapping network translates hidden representations of text into the embedding space of the visual models, enabling us to leverage the strong text representations of the LLM for visual outputs. Our approach outperforms baseline generation models on tasks with longer and more complex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling
