Language Models Can See: Plugging Visual Controls in Text Generation
Yixuan Su, Tian Lan, Yahui Liu, Fangyu Liu, Dani Yogatama, and Yan Wang, Lingpeng Kong, Nigel Collier

TL;DR
This paper introduces MAGIC, a training-free, plug-and-play framework that enables large language models to incorporate visual information from images into text generation tasks like image captioning, achieving high performance and efficiency.
Contribution
MAGIC is the first zero-shot, training-free method combining GPT-2 and CLIP for multimodal text generation with image grounding.
Findings
Outperforms state-of-the-art zero-shot image captioning methods.
Achieves nearly 27 times faster decoding speed.
Capable of visually grounded story generation.
Abstract
Generative language models (LMs) such as GPT-2/3 can be prompted to generate text with remarkable quality. While they are designed for text-prompted generation, it remains an open question how the generation process could be guided by modalities beyond text such as images. In this work, we propose a training-free framework, called MAGIC (iMAge-Guided text generatIon with CLIP), for plugging in visual controls in the generation process and enabling LMs to perform multimodal tasks (e.g., image captioning) in a zero-shot manner. MAGIC is a simple yet efficient plug-and-play framework, which directly combines an off-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP) for image-grounded text generation. During decoding, MAGIC influences the generation of the LM by introducing a CLIP-induced score, called magic score, which regularizes the generated result to be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Artificial Intelligence in Games
