Learning to Customize Text-to-Image Diffusion In Diverse Context
Taewook Kim, Wei Chen, Qiang Qiu

TL;DR
This paper introduces a simple, cost-effective method to improve text-to-image model customization by diversifying textual prompts, enhancing semantic alignment and prompt fidelity without architectural changes.
Contribution
The authors propose a novel approach that diversifies personal concept context solely in textual prompts, significantly boosting customization quality across multiple baseline methods.
Findings
Improved CLIP scores across different customization methods.
Enhanced semantic alignment in textual and image spaces.
Method is compatible with existing customization techniques.
Abstract
Most text-to-image customization techniques fine-tune models on a small set of \emph{personal concept} images captured in minimal contexts. This often results in the model becoming overfitted to these training images and unable to generalize to new contexts in future text prompts. Existing customization methods are built on the success of effectively representing personal concepts as textual embeddings. Thus, in this work, we resort to diversifying the context of these personal concepts \emph{solely} within the textual space by simply creating a contextually rich set of text prompts, together with a widely used self-supervised learning objective. Surprisingly, this straightforward and cost-effective method significantly improves semantic alignment in the textual space, and this effect further extends to the image space, resulting in higher prompt fidelity for generated images.…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The paper is generally well-organized and provides clear explanations of the method, including detailed descriptions of the MLM application. - The proposed approach requires no architectural modifications and the proposed approach avoids generating new image pairs, reducing computational costs.
The main weakness is that the experimental evaluation is not convincing. - Missing evaluation of the quality of the generated personal concept: image similarity between the generated personal concept and the reference personal concept has to be evaluated. - The selected personal concepts are fairly weak in the sense that most concepts do not have “unique” personal chrematistics. Instead, most of them have been learned during SD model pretraining already. More unique personal concepts should be e
- In terms of originality, although masked language modeling (MLM) is nothing new, it is a bit novel to use it in the customization of T2I generation. - The writing of this paper is generally clear. The method is described clearly and is easy to understand, although a few texts are unclear (see weaknesses). - The proposed method enhances the alignment between the generated images and the text prompts. From the qualitative results, I can see that the method results in better alignment. The quant
- The subject fidelity metrics (DINO and DINO-FG) of the proposed method are worse than the baselines, including DreamBooth (DB) and CustomDiffusion (CD). It shows that the proposed method may increase text-image alignment CLIP-T at the cost of reduced subject fidelity. - I found the theories unconvincing. The authors want to use Proposition 1 to prove the distance between the context tokens and concept tokens is bounded by a small value $\delta_V$. Firstly, I do not understand why the value ma
The authors have developed a framework for generating personalized images that effectively integrates the context diversification of personal concept using masked language modeling and to solve the issue of concept overfitting. The paper provides experimental results, including both quantitative and qualitative assessments, showcasing the superior performance of the framework. The results clearly highlight the effectiveness of the proposed method in facilitating personalized image generation.
In Figure 3, the method utilizes Masked Language Modeling module to enhance the identity details. However, it is unclear how it would perform with fine-grained subjects (two dogs or two cats with different breed). Also, I wonder how this module would work when the concept size increases. Clarification is needed on whether the module can effectively manage such fine distinctions and multiple diverse subjects. Recent methodologies [1, 2] have demonstrated the capability to learn multi-concept per
1. The proposed method was shown to be robust, and the performance is competitive, showing the effectiveness of the designs. 2. The writing of this paper is easy to follow.
1. The proposed method does not outperform DB and CD in terms of DINO and DINO-FG metrics, which weakens the evidence for its effectiveness. 2. The proposed method may introduce some additional image concepts, such as (Row 4, CD vs. Ours) and (Row 5, DB vs. Ours), where new visual elements (e.g., leaves or tree) appear in the images generated by the proposed method. This diminishes the effectiveness of the proposed method.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Natural Language Processing Techniques
MethodsContrastive Language-Image Pre-training · Sparse Evolutionary Training
