OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction
Leheng Li, Weichao Qiu, Xu Yan, Jing He, Kaiqiang Zhou, Yingjie Cai,, Qing Lian, Bingbing Liu, Ying-Cong Chen

TL;DR
OmniBooth introduces a versatile image synthesis framework that allows precise spatial control and multi-modal customization using text and image guidance, significantly enhancing controllability and fidelity in image generation.
Contribution
The paper proposes latent control signals that unify spatial, textual, and image conditions, extending ControlNet for instance-level open-vocabulary generation and personalized identity control.
Findings
Improved image synthesis fidelity and alignment across tasks.
Enhanced controllability with multi-modal conditions.
Effective integration of spatial, textual, and image guidance.
Abstract
We present OmniBooth, an image generation framework that enables spatial control with instance-level multi-modal customization. For all instances, the multimodal instruction can be described through text prompts or image references. Given a set of user-defined masks and associated text or image guidance, our objective is to generate an image, where multiple objects are positioned at specified coordinates and their attributes are precisely aligned with the corresponding guidance. This approach significantly expands the scope of text-to-image generation, and elevates it to a more versatile and practical dimension in controllability. In this paper, our core contribution lies in the proposed latent control signals, a high-dimensional spatial feature that provides a unified representation to integrate the spatial, textual, and image conditions seamlessly. The text condition extends…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Handwritten Text Recognition Techniques
MethodsSparse Evolutionary Training
