Rethinking Global Text Conditioning in Diffusion Transformers
Nikita Starodubcev, Daniil Pakhomov, Zongze Wu, Ilya Drobyshevskiy, Yuchen Liu, Zhonghao Wang, Yuqian Zhou, Zhe Lin, Dmitry Baranchuk

TL;DR
This paper investigates the necessity of modulation-based text conditioning in diffusion transformers, finding that attention alone suffices for prompt propagation but pooled embeddings can enhance controllability and performance in various tasks.
Contribution
The study demonstrates that conventional pooled embeddings add little to performance but can be used as guidance for controllability, offering a simple, training-free improvement applicable across models.
Findings
Attention alone effectively propagates prompt information.
Pooled embeddings serve as guidance for controllable shifts.
Significant performance gains in diverse diffusion tasks.
Abstract
Diffusion transformers typically incorporate textual information via attention layers and a modulation mechanism using a pooled text embedding. Nevertheless, recent approaches discard modulation-based text conditioning and rely exclusively on attention. In this paper, we address whether modulation-based text conditioning is necessary and whether it can provide any performance advantage. Our analysis shows that, in its conventional usage, the pooled embedding contributes little to overall performance, suggesting that attention alone is generally sufficient for faithfully propagating prompt information. However, we reveal that the pooled embedding can provide significant gains when used from a different perspective-serving as guidance and enabling controllable shifts toward more desirable properties. This approach is training-free, simple to implement, incurs negligible runtime overhead,…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper provides a clear and convincing empirical investigation into why global text conditioning appears ineffective in current models, filling an important gap in understanding. 2. Modulation guidance is training-free, easy to implement, computationally lightweight, and broadly applicable across architectures and tasks.
1. The core idea of using pooled embeddings for guidance resembles prior work on semantic directions in GANs (e.g., StyleGAN) and recent methods like TokenVerse or Concept Sliders, though the application to modulation space in diffusion transformers is new. 2. The method introduces new hyperparameters (guidance scale w, layer indices), requiring tuning for different tasks—though ablations help, this adds complexity compared to plug-and-play baselines.
1. The revisiting and discovery that global text conditioning can be leveraged as a powerful control signal—rather than being merely a passive input—is novel. The proposed dynamic modulation guidance demonstrates a clear ability to address classic and stubborn challenges in T2I generation, such as hand synthesis and object counting, which is a significant finding. 2. The paper is impressive in its extensive experimental scope, demonstrating effectiveness across a diverse set of tasks—including
I have the following two major questions: 1. I noticed that different hyperparameters are used for different tasks and generation types/styles In Tab.5. Could the authors provide more detailed guidance on the process of selecting the appropriate strategy and its associated hyperparameters for a **new, unseen task**? Is this process largely heuristic, requiring manual search for each new situation, or are there general principles or a methodology that can be derived from the observations in Figu
The paper is clearly written and well-structured. A primary strength lies in its comprehensive experimental validation. The experiments are thorough and are conducted on 4 T2I models that are trained with CLIP modulation, and even included additional model that was not using CLIP, training it to incorporate CLIP modulation. Furthermore, they include text-to-video models, thereby broadening the applicability of their findings.
The primary weakness is the limited novelty of the method. This method (with the exception of choosing the dynamic modulation strategies) was already presented in [1] as a naive approach (Equation 2). If the authors disagree, I would be happy to discuss and understand the novelty better. A second, smaller weakness, concerns the justification for the proposed dynamic modulation strategies. These strategies are heuristically derived from observed attention patterns within the model. This reliance
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Humanities and Scholarship · Image Enhancement Techniques
