DDAE++: Enhancing Diffusion Models Towards Unified Generative and Discriminative Learning

Weilai Xiang; Hongyu Yang; Di Huang; Yunhong Wang

arXiv:2505.10999·cs.CV·December 23, 2025

DDAE++: Enhancing Diffusion Models Towards Unified Generative and Discriminative Learning

Weilai Xiang, Hongyu Yang, Di Huang, Yunhong Wang

PDF

Open Access 3 Reviews

TL;DR

DDAE++ introduces a lightweight self-conditioning mechanism that enhances diffusion models by better utilizing high-level semantics, leading to improved generative and discriminative capabilities with minimal added complexity.

Contribution

The paper proposes a novel self-conditioning technique that reshapes semantic hierarchies in diffusion models, enabling unified generative and discriminative learning without external guidance.

Findings

01

Improved representation quality in diffusion models across tasks.

02

Enhanced linear probing performance surpassing self-supervised models.

03

Maintained or improved image generation quality.

Abstract

While diffusion models excel at image synthesis, useful representations have been shown to emerge from generative pre-training, suggesting a path towards unified generative and discriminative learning. However, suboptimal semantic flow within current architectures can hinder this potential: features encoding the richest high-level semantics are underutilized and diluted when propagating through decoding layers, impeding the formation of an explicit semantic bottleneck layer. To address this, we introduce self-conditioning, a lightweight mechanism that reshapes the model's layer-wise semantic hierarchy without external guidance. By aggregating and rerouting intermediate features to guide subsequent decoding layers, our method concentrates more high-level semantics, concurrently strengthening global generative guidance and forming more discriminative representations. This simple approach…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 5

Strengths

- The work identifies and addresses a genuine architectural weakness in standard diffusion models: the absence of a discriminative bottleneck due to distributed semantic flow. The “self-conditioning” mechanism is both simple and effective—clearly illustrated in Figure 1 and Figure 2—and does not require external supervision. - The empirical evaluation is extensive and well-controlled, covering a variety of diffusion model backbones (UNet-based, UViT, DiT) and datasets (CIFAR-10/100, Tiny-ImageN

Weaknesses

- Although the experiments are broad, there is a bias towards popular image datasets, especially at lower resolutions. In Section E.1 (Limitations and Future Research Directions), the authors admit that their results stop at ImageNet 256x256 and DiT-base scale due to compute constraints, and do not demonstrate scalability for larger models/datasets relevant to modern AI. - The ablation studies in Table 5 focus on demonstrating parameter sensitivity and certain hyperparameter effects (e.g., MLP

Reviewer 02Rating 6Confidence 5

Strengths

The paper presents a clear and intuitive idea, supported by solid experimental analysis. In particular, the layer-wise feature analysis via linear probing in Figure 5 provides valuable insights into why the proposed DDAE method is effective — this diagnostic perspective is worth learning from. Overall, both Table 3 and Figure 3 convincingly demonstrate that DDAE is an effective and practical approach.

Weaknesses

The claim in Figure 6 that "Self-conditioning facilitates the optimization and narrows the loss gap between un- and class-conditioning" does not hold. From the curves shown, we can only observe that both the conditional and unconditional loss curves converge faster, but there is no evidence indicating that the gap between them actually decreases.

Reviewer 03Rating 2Confidence 3

Strengths

* Analyzing and improving the representation of generative models is an important topic. * The method is applicable to various network backbones. * The ablation study part is performed in detail.

Weaknesses

**Experimental Evaluation**: The reported generative metric, i.e., FID, in this paper, is worse than expected. For example, the official EDM [1] on CIFAR-10-uncond is 1.97, which is much better than the reported 2.23 in the baseline in Table 2. While I understand that re-training the model from scratch could be computationally intensive, the mismatched generative performance makes it hard to judge the efficacy of the proposed method on a well-trained diffusion model. I strongly suggest consideri

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsDiffusion