Improving Long-Text Alignment for Text-to-Image Diffusion Models

Luping Liu; Chao Du; Tianyu Pang; Zehan Wang; Chongxuan Li; Dong Xu

arXiv:2410.11817·cs.CV·March 4, 2025

Improving Long-Text Alignment for Text-to-Image Diffusion Models

Luping Liu, Chao Du, Tianyu Pang, Zehan Wang, Chongxuan Li, Dong Xu

PDF

Open Access 1 Repo 1 Models 3 Reviews

TL;DR

This paper introduces LongAlign, a method for improving long-text to image alignment in diffusion models by segmenting texts and decomposing preference scores, leading to better alignment and reduced overfitting.

Contribution

It proposes a novel segment-level encoding and a decomposed preference optimization approach to enhance long-text alignment in text-to-image diffusion models.

Findings

01

Outperforms existing models like PixArt-α and Kandinsky v2.2 in T2I alignment.

02

Addresses overfitting by reweighting preference score components.

03

Achieves superior results after 20 hours of fine-tuning.

Abstract

The rapid advancement of text-to-image (T2I) diffusion models has enabled them to generate unprecedented results from given texts. However, as text inputs become longer, existing encoding methods like CLIP face limitations, and aligning the generated images with long texts becomes challenging. To tackle these issues, we propose LongAlign, which includes a segment-level encoding method for processing long texts and a decomposed preference optimization method for effective alignment training. For segment-level encoding, long texts are divided into multiple segments and processed separately. This method overcomes the maximum input length limits of pretrained encoding models. For preference optimization, we provide decomposed CLIP-based preference models to fine-tune diffusion models. Specifically, to utilize CLIP-based preference models for T2I alignment, we delve into their scoring…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. The motivation for using preference models is well-founded, and the paper is well-written. 2. It is interesting to identify two distinct focuses within preference models, and the analysis provided is both reasonable and thorough.

Weaknesses

weakness 1. I am unsure why multiple <sot> tokens are retained; regarding the retention or removal of tokens, a more detailed explanation or analysis is needed, as it currently leaves me confused. 2.After reweighting, whether there will be a noticeable difference in the aesthetic quality of the generated results (due to text-irrelevant components) remains unclear. For Appendix B.1, it would be beneficial to provide some visualizations of the outcomes from the two loss functions. 3. Segmenting to

Reviewer 02Rating 3Confidence 4

Strengths

[1] It introduces a segment-level encoding strategy that effectively handles long text inputs by dividing and separately encoding segments, overcoming traditional model input limitations and enhancing text-to-image alignment. [2] The preference model is innovatively decomposed into text-relevant and text-irrelevant components, with a reweighting strategy to reduce overfitting and improve alignment precision. [3] The paper conducts extensive experiments, demonstrating significant improvements in

Weaknesses

[1] The paper proposes a segment-level encoding strategy to handle long texts but does not thoroughly validate the performance of this strategy under different text length conditions. For very short or very long texts, can the segment-level encoding still maintain the same alignment effectiveness? The lack of fine-grained comparative experiments makes it difficult to adequately demonstrate the applicability of segment-level encoding across a wide range of text lengths. [2] The paper proposes a r

Reviewer 03Rating 6Confidence 4

Strengths

The paper tackles the crucial challenge of long prompt following in a very effective manner. Using a text encoder that can take the entire long prompt is a sound idea, and the Denscore preference model looks like a useful contribution in general. Apart from this, the reward fine-tuning with the orthogonal decomposition and the gradient reweighting looks like a good idea to deal with the "reward-hacking" problem. Finally, the results also appear quite strong from the evaluations presented in the

Weaknesses

An important paper that is missed here is ELLA[Hu et al. 2024] for a couple of reasons. The first is that they propose replacing the CLIP encoder of SD1.5 with a T5-XL model and get significantly improved results (far superior numbers to those reported by Lavi-Bridge whose MLP adapter is used here). Therefore, this model might be a valid comparison (although the training cost of ELLA is a bit higher: 7 days with 8 A100s for SD1.5). Alternatively, the adapter provided by ELLA would have probably

Code & Models

Repositories

luping-liu/longalign
pytorchOfficial

Models

🤗
iSolver-AI/FEnet
model· 54 dl
54 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques

MethodsDiffusion · Contrastive Language-Image Pre-training