Rethinking Global Text Conditioning in Diffusion Transformers

Nikita Starodubcev; Daniil Pakhomov; Zongze Wu; Ilya Drobyshevskiy; Yuchen Liu; Zhonghao Wang; Yuqian Zhou; Zhe Lin; Dmitry Baranchuk

arXiv:2602.09268·cs.CV·February 11, 2026

Rethinking Global Text Conditioning in Diffusion Transformers

Nikita Starodubcev, Daniil Pakhomov, Zongze Wu, Ilya Drobyshevskiy, Yuchen Liu, Zhonghao Wang, Yuqian Zhou, Zhe Lin, Dmitry Baranchuk

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the necessity of modulation-based text conditioning in diffusion transformers, finding that attention alone suffices for prompt propagation but pooled embeddings can enhance controllability and performance in various tasks.

Contribution

The study demonstrates that conventional pooled embeddings add little to performance but can be used as guidance for controllability, offering a simple, training-free improvement applicable across models.

Findings

01

Attention alone effectively propagates prompt information.

02

Pooled embeddings serve as guidance for controllable shifts.

03

Significant performance gains in diverse diffusion tasks.

Abstract

Diffusion transformers typically incorporate textual information via attention layers and a modulation mechanism using a pooled text embedding. Nevertheless, recent approaches discard modulation-based text conditioning and rely exclusively on attention. In this paper, we address whether modulation-based text conditioning is necessary and whether it can provide any performance advantage. Our analysis shows that, in its conventional usage, the pooled embedding contributes little to overall performance, suggesting that attention alone is generally sufficient for faithfully propagating prompt information. However, we reveal that the pooled embedding can provide significant gains when used from a different perspective-serving as guidance and enabling controllable shifts toward more desirable properties. This approach is training-free, simple to implement, incurs negligible runtime overhead,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper provides a clear and convincing empirical investigation into why global text conditioning appears ineffective in current models, filling an important gap in understanding. 2. Modulation guidance is training-free, easy to implement, computationally lightweight, and broadly applicable across architectures and tasks.

Weaknesses

1. The core idea of using pooled embeddings for guidance resembles prior work on semantic directions in GANs (e.g., StyleGAN) and recent methods like TokenVerse or Concept Sliders, though the application to modulation space in diffusion transformers is new. 2. The method introduces new hyperparameters (guidance scale w, layer indices), requiring tuning for different tasks—though ablations help, this adds complexity compared to plug-and-play baselines.

Reviewer 02Rating 6Confidence 4

Strengths

1. The revisiting and discovery that global text conditioning can be leveraged as a powerful control signal—rather than being merely a passive input—is novel. The proposed dynamic modulation guidance demonstrates a clear ability to address classic and stubborn challenges in T2I generation, such as hand synthesis and object counting, which is a significant finding. 2. The paper is impressive in its extensive experimental scope, demonstrating effectiveness across a diverse set of tasks—including

Weaknesses

I have the following two major questions: 1. I noticed that different hyperparameters are used for different tasks and generation types/styles In Tab.5. Could the authors provide more detailed guidance on the process of selecting the appropriate strategy and its associated hyperparameters for a **new, unseen task**? Is this process largely heuristic, requiring manual search for each new situation, or are there general principles or a methodology that can be derived from the observations in Figu

Reviewer 03Rating 8Confidence 3

Strengths

The paper is clearly written and well-structured. A primary strength lies in its comprehensive experimental validation. The experiments are thorough and are conducted on 4 T2I models that are trained with CLIP modulation, and even included additional model that was not using CLIP, training it to incorporate CLIP modulation. Furthermore, they include text-to-video models, thereby broadening the applicability of their findings.

Weaknesses

The primary weakness is the limited novelty of the method. This method (with the exception of choosing the dynamic modulation strategies) was already presented in [1] as a naive approach (Equation 2). If the authors disagree, I would be happy to discuss and understand the novelty better. A second, smaller weakness, concerns the justification for the proposed dynamic modulation strategies. These strategies are heuristically derived from observed attention patterns within the model. This reliance

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Humanities and Scholarship · Image Enhancement Techniques