Generating Images with Multimodal Language Models

Jing Yu Koh; Daniel Fried; Ruslan Salakhutdinov

arXiv:2305.17216·cs.CL·October 16, 2023·39 cites

Generating Images with Multimodal Language Models

Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a multimodal model that combines large language models with visual encoders and decoders, enabling complex image and text generation, retrieval, and dialogue, surpassing previous models in handling longer, more complex language inputs.

Contribution

It presents a novel method to fuse frozen LLMs with visual models via embedding space mapping, allowing flexible multimodal input processing and output generation.

Findings

01

Outperforms baseline models on complex language tasks

02

Capable of image retrieval and generation from interleaved inputs

03

Outperforms non-LLM models in text-to-image tasks

Abstract

We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to generate coherent image (and text) outputs. To achieve strong performance on image generation, we propose an efficient mapping network to ground the LLM to an off-the-shelf text-to-image generation model. This mapping network translates hidden representations of text into the embedding space of the visual models, enabling us to leverage the strong text representations of the LLM for visual outputs. Our approach outperforms baseline generation models on tasks with longer and more complex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kohjingyu/gill
pytorchOfficial

Videos

Generating Images with Multimodal Language Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling