UNIMO-G: Unified Image Generation through Multimodal Conditional   Diffusion

Wei Li; Xue Xu; Jiachen Liu; Xinyan Xiao

arXiv:2401.13388·cs.CV·June 7, 2024·1 cites

UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

Wei Li, Xue Xu, Jiachen Liu, Xinyan Xiao

PDF

Open Access

TL;DR

UNIMO-G introduces a unified multimodal diffusion framework that effectively generates high-fidelity images from complex prompts combining text and visual inputs, advancing both text-driven and subject-driven image synthesis.

Contribution

It presents a novel multimodal conditional diffusion model with a two-stage training process for unified image generation from multimodal prompts.

Findings

01

Excels in text-to-image and zero-shot subject-driven synthesis.

02

Generates high-fidelity images from complex multimodal prompts.

03

Effective in handling multiple image entities in prompts.

Abstract

Existing text-to-image diffusion models primarily generate images from text prompts. However, the inherent conciseness of textual descriptions poses challenges in faithfully synthesizing images with intricate details, such as specific entities or scenes. This paper presents UNIMO-G, a simple multimodal conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs, which demonstrates a unified ability for both text-driven and subject-driven image generation. UNIMO-G comprises two core components: a Multimodal Large Language Model (MLLM) for encoding multimodal prompts, and a conditional denoising diffusion network for generating images based on the encoded multimodal input. We leverage a two-stage training strategy to effectively train the framework: firstly pre-training on large-scale text-image pairs to develop conditional image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning

MethodsDiffusion