Improving Long-Text Alignment for Text-to-Image Diffusion Models
Luping Liu, Chao Du, Tianyu Pang, Zehan Wang, Chongxuan Li, Dong Xu

TL;DR
This paper introduces LongAlign, a method for improving long-text to image alignment in diffusion models by segmenting texts and decomposing preference scores, leading to better alignment and reduced overfitting.
Contribution
It proposes a novel segment-level encoding and a decomposed preference optimization approach to enhance long-text alignment in text-to-image diffusion models.
Findings
Outperforms existing models like PixArt-α and Kandinsky v2.2 in T2I alignment.
Addresses overfitting by reweighting preference score components.
Achieves superior results after 20 hours of fine-tuning.
Abstract
The rapid advancement of text-to-image (T2I) diffusion models has enabled them to generate unprecedented results from given texts. However, as text inputs become longer, existing encoding methods like CLIP face limitations, and aligning the generated images with long texts becomes challenging. To tackle these issues, we propose LongAlign, which includes a segment-level encoding method for processing long texts and a decomposed preference optimization method for effective alignment training. For segment-level encoding, long texts are divided into multiple segments and processed separately. This method overcomes the maximum input length limits of pretrained encoding models. For preference optimization, we provide decomposed CLIP-based preference models to fine-tune diffusion models. Specifically, to utilize CLIP-based preference models for T2I alignment, we delve into their scoring…
Peer Reviews
Decision·ICLR 2025 Poster
1. The motivation for using preference models is well-founded, and the paper is well-written. 2. It is interesting to identify two distinct focuses within preference models, and the analysis provided is both reasonable and thorough.
weakness 1. I am unsure why multiple <sot> tokens are retained; regarding the retention or removal of tokens, a more detailed explanation or analysis is needed, as it currently leaves me confused. 2.After reweighting, whether there will be a noticeable difference in the aesthetic quality of the generated results (due to text-irrelevant components) remains unclear. For Appendix B.1, it would be beneficial to provide some visualizations of the outcomes from the two loss functions. 3. Segmenting to
[1] It introduces a segment-level encoding strategy that effectively handles long text inputs by dividing and separately encoding segments, overcoming traditional model input limitations and enhancing text-to-image alignment. [2] The preference model is innovatively decomposed into text-relevant and text-irrelevant components, with a reweighting strategy to reduce overfitting and improve alignment precision. [3] The paper conducts extensive experiments, demonstrating significant improvements in
[1] The paper proposes a segment-level encoding strategy to handle long texts but does not thoroughly validate the performance of this strategy under different text length conditions. For very short or very long texts, can the segment-level encoding still maintain the same alignment effectiveness? The lack of fine-grained comparative experiments makes it difficult to adequately demonstrate the applicability of segment-level encoding across a wide range of text lengths. [2] The paper proposes a r
The paper tackles the crucial challenge of long prompt following in a very effective manner. Using a text encoder that can take the entire long prompt is a sound idea, and the Denscore preference model looks like a useful contribution in general. Apart from this, the reward fine-tuning with the orthogonal decomposition and the gradient reweighting looks like a good idea to deal with the "reward-hacking" problem. Finally, the results also appear quite strong from the evaluations presented in the
An important paper that is missed here is ELLA[Hu et al. 2024] for a couple of reasons. The first is that they propose replacing the CLIP encoder of SD1.5 with a T5-XL model and get significantly improved results (far superior numbers to those reported by Lavi-Bridge whose MLP adapter is used here). Therefore, this model might be a valid comparison (although the training cost of ELLA is a bit higher: 7 days with 8 A100s for SD1.5). Alternatively, the adapter provided by ELLA would have probably
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsDiffusion · Contrastive Language-Image Pre-training
