Language Models Can See: Plugging Visual Controls in Text Generation

Yixuan Su; Tian Lan; Yahui Liu; Fangyu Liu; Dani Yogatama; and Yan Wang; Lingpeng Kong; Nigel Collier

arXiv:2205.02655·cs.CV·June 1, 2022·38 cites

Language Models Can See: Plugging Visual Controls in Text Generation

Yixuan Su, Tian Lan, Yahui Liu, Fangyu Liu, Dani Yogatama, and Yan Wang, Lingpeng Kong, Nigel Collier

PDF

Open Access 1 Repo

TL;DR

This paper introduces MAGIC, a training-free, plug-and-play framework that enables large language models to incorporate visual information from images into text generation tasks like image captioning, achieving high performance and efficiency.

Contribution

MAGIC is the first zero-shot, training-free method combining GPT-2 and CLIP for multimodal text generation with image grounding.

Findings

01

Outperforms state-of-the-art zero-shot image captioning methods.

02

Achieves nearly 27 times faster decoding speed.

03

Capable of visually grounded story generation.

Abstract

Generative language models (LMs) such as GPT-2/3 can be prompted to generate text with remarkable quality. While they are designed for text-prompted generation, it remains an open question how the generation process could be guided by modalities beyond text such as images. In this work, we propose a training-free framework, called MAGIC (iMAge-Guided text generatIon with CLIP), for plugging in visual controls in the generation process and enabling LMs to perform multimodal tasks (e.g., image captioning) in a zero-shot manner. MAGIC is a simple yet efficient plug-and-play framework, which directly combines an off-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP) for image-grounded text generation. During decoding, MAGIC influences the generation of the LM by introducing a CLIP-induced score, called magic score, which regularizes the generated result to be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yxuansu/magic
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Artificial Intelligence in Games