Instruct-Imagen: Image Generation with Multi-modal Instruction

Hexiang Hu; Kelvin C.K. Chan; Yu-Chuan Su; Wenhu Chen; Yandong Li,; Kihyuk Sohn; Yang Zhao; Xue Ben; Boqing Gong; William Cohen; Ming-Wei Chang,; Xuhui Jia

arXiv:2401.01952·cs.CV·January 5, 2024·1 cites

Instruct-Imagen: Image Generation with Multi-modal Instruction

Hexiang Hu, Kelvin C.K. Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li,, Kihyuk Sohn, Yang Zhao, Xue Ben, Boqing Gong, William Cohen, Ming-Wei Chang,, Xuhui Jia

PDF

Open Access

TL;DR

Instruct-Imagen introduces a multi-modal instruction framework for versatile image generation, enabling the model to understand and generate images across diverse and unseen tasks by leveraging natural language and external multimodal context.

Contribution

The paper presents a novel multi-modal instruction approach and a two-stage fine-tuning process that enhances a text-to-image diffusion model's ability to generalize across various image generation tasks.

Findings

01

Matches or surpasses prior models in in-domain tasks

02

Demonstrates strong generalization to unseen tasks

03

Effective grounding on external multimodal context

Abstract

This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsDiffusion