Instruct-Imagen: Image Generation with Multi-modal Instruction
Hexiang Hu, Kelvin C.K. Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li,, Kihyuk Sohn, Yang Zhao, Xue Ben, Boqing Gong, William Cohen, Ming-Wei Chang,, Xuhui Jia

TL;DR
Instruct-Imagen introduces a multi-modal instruction framework for versatile image generation, enabling the model to understand and generate images across diverse and unseen tasks by leveraging natural language and external multimodal context.
Contribution
The paper presents a novel multi-modal instruction approach and a two-stage fine-tuning process that enhances a text-to-image diffusion model's ability to generalize across various image generation tasks.
Findings
Matches or surpasses prior models in in-domain tasks
Demonstrates strong generalization to unseen tasks
Effective grounding on external multimodal context
Abstract
This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsDiffusion
