A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers
Trung X. Pham, Kang Zhang, Ji Woo Hong, Chang D. Yoo

TL;DR
This paper uncovers a semantic bottleneck in diffusion transformer embeddings, showing that most semantic information is concentrated in few dimensions and that pruning low-magnitude dimensions does not harm, and can even improve, generation quality.
Contribution
It provides the first systematic analysis of conditional embeddings in diffusion transformers, revealing redundancy and a semantic bottleneck that suggest more efficient conditioning methods.
Findings
Class-conditioned embeddings are highly angularly similar (>99%).
Semantic information is concentrated in a small subset of dimensions.
Pruning low-magnitude dimensions maintains or improves generation quality.
Abstract
Diffusion Transformers have achieved state-of-the-art performance in class-conditional and multimodal generation, yet the structure of their learned conditional embeddings remains poorly understood. In this work, we present the first systematic study of these embeddings and uncover a notable redundancy: class-conditioned embeddings exhibit extreme angular similarity, exceeding 99\% on ImageNet-1K, while continuous-condition tasks such as pose-guided image generation and video-to-audio generation reach over 99.9\%. We further find that semantic information is concentrated in a small subset of dimensions, with head dimensions carrying the dominant signal and tail dimensions contributing minimally. By pruning low-magnitude dimensions--removing up to two-thirds of the embedding space--we show that generation quality and fidelity remain largely unaffected, and in some cases improve. These…
Peer Reviews
Decision·ICLR 2026 Poster
Besides discovering the similarity and sparsity from condition embeddings in DiT models, this paper also tries to convert the observation to action by pruning the condition and provides multiple hypotheses for further exploration.
By discovering this phenomenon, the authors hope that “future architectures could benefit from compressed or hybrid conditioning strategies that maintain semantic fidelity while reducing computational overhead.” However, the conditional embedding carries not only the condition information but also the timestep information, which is important for image generation diffusion models. It’s not enough to only evaluate the semantic fidelity of the conditional embedding. Meanwhile, the paper does not p
1. The paper evaluates a broad set of transformer-based diffusion models, including DiT, MDT, SiT, REPA, LightningDiT, etc, on diverse benchmarks. And the paper shows evidence of near-uniform cosine similarity and embedding sparsity. The breadth of experiments convincingly substantiates the paper’s core claims. 2. The figures and results tables in the paper vividly illustrate the redundancy and sparsity of class embeddings, especially the high cosine similarity and the dominance of a few embed
1. The theoretical analysis, while mathematically sound in defining PR and sparsity, stops short of providing a fundamental theoretical explanation of why such extreme redundancy and cosine similarity emerge in the conditional embeddings of transformer-based diffusion models. The paper’s stated hypotheses remain largely empirical and conceptual, lacking formal proofs or broader generalization to other conditioning modalities or even to transformer architectures outside the diffusion context. For
1. It is the first work to systematically investigate the internal structure of conditional embeddings in diffusion Transformers, uncovering two core emergent properties—extreme angular similarity (exceeding 99% on ImageNet-1K and 99.9% on continuous-condition tasks) and high-dimensional sparsity (only 1–2% of dimensions carrying substantial semantic information). This fills a critical gap in existing literature, which has primarily focused on architectural advances rather than the intrinsic cha
1. Although the author reveals the redundancy problem in condition embeddings of diffusion models via empirical analysis, they do not provide any further method to improve the efficiency or performance of existing models based on the observation. 2. More conditions should be considered. For instance, user-generated descriptions or prompts are more general than class-conditional prompts. I suggest the author investigate more types of conditions. 3. Maybe the time embedding and prompt embedding sh
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Face recognition and analysis
