MultiBooth: Towards Generating All Your Concepts in an Image from Text
Chenyang Zhu, Kai Li, Yue Ma, Chunming He, Xiu Li

TL;DR
MultiBooth is a new method for multi-concept image generation from text that improves fidelity and efficiency by dividing the process into learning individual concepts and integrating them with bounding boxes.
Contribution
It introduces a two-phase approach with a multi-modal encoder and bounding box guidance, enhancing multi-concept image generation in diffusion models.
Findings
Outperforms baselines in qualitative evaluations
Achieves higher concept fidelity
Reduces inference cost
Abstract
This paper introduces MultiBooth, a novel and efficient technique for multi-concept customization in image generation from text. Despite the significant advancements in customized generation methods, particularly with the success of diffusion models, existing methods often struggle with multi-concept scenarios due to low concept fidelity and high inference cost. MultiBooth addresses these issues by dividing the multi-concept generation process into two phases: a single-concept learning phase and a multi-concept integration phase. During the single-concept learning phase, we employ a multi-modal image encoder and an efficient concept encoding technique to learn a concise and discriminative representation for each concept. In the multi-concept integration phase, we use bounding boxes to define the generation area for each concept within the cross-attention map. This method enables the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsDiffusion
