UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion
Wei Li, Xue Xu, Jiachen Liu, Xinyan Xiao

TL;DR
UNIMO-G introduces a unified multimodal diffusion framework that effectively generates high-fidelity images from complex prompts combining text and visual inputs, advancing both text-driven and subject-driven image synthesis.
Contribution
It presents a novel multimodal conditional diffusion model with a two-stage training process for unified image generation from multimodal prompts.
Findings
Excels in text-to-image and zero-shot subject-driven synthesis.
Generates high-fidelity images from complex multimodal prompts.
Effective in handling multiple image entities in prompts.
Abstract
Existing text-to-image diffusion models primarily generate images from text prompts. However, the inherent conciseness of textual descriptions poses challenges in faithfully synthesizing images with intricate details, such as specific entities or scenes. This paper presents UNIMO-G, a simple multimodal conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs, which demonstrates a unified ability for both text-driven and subject-driven image generation. UNIMO-G comprises two core components: a Multimodal Large Language Model (MLLM) for encoding multimodal prompts, and a conditional denoising diffusion network for generating images based on the encoded multimodal input. We leverage a two-stage training strategy to effectively train the framework: firstly pre-training on large-scale text-image pairs to develop conditional image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
MethodsDiffusion
