OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal   Instruction

Leheng Li; Weichao Qiu; Xu Yan; Jing He; Kaiqiang Zhou; Yingjie Cai,; Qing Lian; Bingbing Liu; Ying-Cong Chen

arXiv:2410.04932·cs.CV·October 8, 2024

OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction

Leheng Li, Weichao Qiu, Xu Yan, Jing He, Kaiqiang Zhou, Yingjie Cai,, Qing Lian, Bingbing Liu, Ying-Cong Chen

PDF

Open Access 1 Models

TL;DR

OmniBooth introduces a versatile image synthesis framework that allows precise spatial control and multi-modal customization using text and image guidance, significantly enhancing controllability and fidelity in image generation.

Contribution

The paper proposes latent control signals that unify spatial, textual, and image conditions, extending ControlNet for instance-level open-vocabulary generation and personalized identity control.

Findings

01

Improved image synthesis fidelity and alignment across tasks.

02

Enhanced controllability with multi-modal conditions.

03

Effective integration of spatial, textual, and image guidance.

Abstract

We present OmniBooth, an image generation framework that enables spatial control with instance-level multi-modal customization. For all instances, the multimodal instruction can be described through text prompts or image references. Given a set of user-defined masks and associated text or image guidance, our objective is to generate an image, where multiple objects are positioned at specified coordinates and their attributes are precisely aligned with the corresponding guidance. This approach significantly expands the scope of text-to-image generation, and elevates it to a more versatile and practical dimension in controllability. In this paper, our core contribution lies in the proposed latent control signals, a high-dimensional spatial feature that provides a unified representation to integrate the spatial, textual, and image conditions seamlessly. The text condition extends…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
lilelife/OmniBooth
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Handwritten Text Recognition Techniques

MethodsSparse Evolutionary Training