Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model
Xiyuan Wang, Muhan Zhang

TL;DR
This paper introduces Diffusion as Self-Distillation (DSD), a novel framework that unifies the encoder, decoder, and diffusion network into a single end-to-end trainable model, overcoming stability issues and achieving high-quality image generation.
Contribution
The paper proposes DSD, a new training framework that stabilizes end-to-end latent diffusion training by drawing an analogy with self-distillation, enabling unified models.
Findings
Stable end-to-end training of a single network for encoding, decoding, and diffusion.
Achieved state-of-the-art conditional image generation on ImageNet 256x256.
Reduced model complexity with fewer parameters and training epochs.
Abstract
Standard Latent Diffusion Models rely on a complex, three-part architecture consisting of a separate encoder, decoder, and diffusion network, which are trained in multiple stages. This modular design is computationally inefficient, leads to suboptimal performance, and prevents the unification of diffusion with the single-network architectures common in vision foundation models. Our goal is to unify these three components into a single, end-to-end trainable network. We first demonstrate that a naive joint training approach fails catastrophically due to ``latent collapse'', where the diffusion training objective interferes with the network's ability to learn a good latent representation. We identify the root causes of this instability by drawing a novel analogy between diffusion and self-distillation based unsupervised learning method. Based on this insight, we propose Diffusion as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis
