P+: Extended Textual Conditioning in Text-to-Image Generation
Andrey Voynov, Qinghao Chu, Daniel Cohen-Or, Kfir Aberman

TL;DR
This paper introduces P+, an extended textual conditioning space for text-to-image models, enabling greater control and personalization, along with a new inversion method called XTI that is more expressive and faster.
Contribution
The paper proposes P+ for enhanced control in text-to-image generation and introduces XTI, a more expressive and efficient inversion technique compared to existing methods.
Findings
P+ provides better disentangling and control over image synthesis.
XTI achieves more precise and faster inversion than traditional Textual Inversion.
The method enables novel object-style mixing in generated images.
Abstract
We introduce an Extended Textual Conditioning space in text-to-image models, referred to as . This space consists of multiple textual conditions, derived from per-layer prompts, each corresponding to a layer of the denoising U-net of the diffusion model. We show that the extended space provides greater disentangling and control over image synthesis. We further introduce Extended Textual Inversion (XTI), where the images are inverted into , and represented by per-layer tokens. We show that XTI is more expressive and precise, and converges faster than the original Textual Inversion (TI) space. The extended inversion method does not involve any noticeable trade-off between reconstruction and editability and induces more regular inversions. We conduct a series of extensive experiments to analyze and understand the properties of the new space, and to showcase the effectiveness…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
- The proposed method is simple and effective. - XTI seems effective from the experimental results under different settings. It also enables the application of style mixing. - The writing is good and clear.
- The novelty can be limited. I don't see the novelty of the P+ definition or why spending paragraphs describing the P+ space is important. While the observations are interesting, I don't see novel insights introduced with P+. In fact, I think a similar observation that outer layers influence high-frequency appearance and inner layers contribute to low-frequency shape was introduced in some previous studies like Prompt-to-prompt. - While the authors have argued about the unfair comparison betwe
- The observation that different cross-attention layers capture distinct attributes of an image is interesting and can be useful to understand the decision making of text-to-image diffusion models. - Paper is extremely well written, so good job to the authors! - The qualitative and quantitative improvements over Textual Inversion is significant (Fig. 6 and Table 1). The authors also provide a trick to reduce the optimization steps from Textual Inversion to improve the efficiency of the fine-tun
- My primary concern with the paper is that it has not compared well with other baselines. Although other methods fine-tune some part of the diffusion model (and are expensive) — the authors should present all the results and the corresponding running time to provide the complete picture. Some of the methods which the authors should compare in completeness are: (i) Custom Diffusion (https://arxiv.org/abs/2212.04488); (ii) ELITE (https://arxiv.org/pdf/2302.13848.pdf); - How can this method be us
The introduced Extended Textual Conditioning space allows for a more nuanced and controllable text-to-image synthesis. The Extended Textual Inversion (XTI) method that improves upon the existing Textual Inversion (TI) technique is novel, and provides faster convergence and better quality. Demonstration of groundbreaking results in object-appearance mixing through the use of the newly introduced P+ space.
The computational cost of XTI needs to be compared with other embedding inversion techniques. Also the inference cost compared with standard textual conditioning, which I assume is the same? More analysis and visualization can be done for different cross-attention layers. For example, what will happen if we provide shape textual embeddings to layer with spatial resolution of 32? I am curious of the sensitivity and affects of different layers.
- The paper is well written and easy to follow, with sufficient literature references and reasonable and intuitive design in terms of extended controlling space. - The findings illustrated in Fig. 3 in terms of per-layer prompting is interesting.
- The XTI section is a bit confusing, the proposed XTI reconstruction loss (also, this equation has no numbering which makes it difficult to refer to) seems to be a stepwise loss, which means the extended space is constructed/optimized at every diffusion step of T2I models? - Following the first point, while this operation is intuitive as many existing editing methods do follow the step-wise iterative paradigm, it is worth doing some ablations/analytical experiments on this particular operation
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Advanced Vision and Imaging
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · Convolution · Max Pooling · U-Net · Diffusion
