Transformers Learn the Optimal DDPM Denoiser for Multi-Token GMMs
Hongkang Li, Hancheng Min, Rene Vidal

TL;DR
This paper provides the first convergence analysis of transformer-based diffusion models, showing how they approximate optimal denoising and the conditions needed for convergence in multi-token Gaussian mixtures.
Contribution
It offers a theoretical understanding of why transformers excel in diffusion models, including convergence conditions and the role of self-attention in denoising.
Findings
Transformer models can converge to the Bayes optimal denoising risk.
Self-attention modules implement a mean denoising mechanism.
Numerical experiments validate the theoretical analysis.
Abstract
Transformer-based diffusion models have demonstrated remarkable performance at generating high-quality samples. However, our theoretical understanding of the reasons for this success remains limited. For instance, existing models are typically trained by minimizing a denoising objective, which is equivalent to fitting the score function of the training data. However, we do not know why transformer-based models can match the score function for denoising, or why gradient-based methods converge to the optimal denoising model despite the non-convex loss landscape. To the best of our knowledge, this paper provides the first convergence analysis for training transformer-based diffusion models. More specifically, we consider the population Denoising Diffusion Probabilistic Model (DDPM) objective for denoising data that follow a multi-token Gaussian mixture distribution. We theoretically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
