Grounding Language Models to Images for Multimodal Inputs and Outputs

Jing Yu Koh; Ruslan Salakhutdinov; Daniel Fried

arXiv:2301.13823·cs.CL·June 16, 2023·25 cites

Grounding Language Models to Images for Multimodal Inputs and Outputs

Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces a method to adapt pretrained text-only language models for multimodal tasks by grounding them in the visual domain, enabling interleaved image-text processing and generation without retraining the entire model.

Contribution

The authors propose a simple finetuning approach that keeps the language model frozen and only trains linear layers for cross-modality, allowing flexible multimodal input and output handling.

Findings

01

Achieves strong zero-shot performance on grounded tasks

02

Enables processing of arbitrarily interleaved image and text inputs

03

Supports generation of text interleaved with retrieved images

Abstract

We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Grounding Language Models to Images for Multimodal Inputs and Outputs· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling