Lipschitz Singularities in Diffusion Models
Zhantao Yang, Ruili Feng, Han Zhang, Yujun Shen, Kai Zhu, Lianghua, Huang, Yifei Zhang, Yu Liu, Deli Zhao, Jingren Zhou, Fan Cheng

TL;DR
This paper investigates the Lipschitz properties of diffusion models, revealing singularities that affect stability, and proposes a new method, E-TSDM, to mitigate these issues, leading to improved performance and reduced FID scores.
Contribution
The paper provides theoretical and empirical analysis of Lipschitz singularities in diffusion models and introduces E-TSDM, a novel approach to alleviate these singularities near zero timesteps.
Findings
Lipschitz singularities are present near zero timesteps in diffusion models.
E-TSDM significantly reduces Lipschitz singularities and improves model performance.
Achieved over 33% reduction in FID scores for acceleration methods.
Abstract
Diffusion models, which employ stochastic differential equations to sample images through integrals, have emerged as a dominant class of generative models. However, the rationality of the diffusion process itself receives limited attention, leaving the question of whether the problem is well-posed and well-conditioned. In this paper, we explore a perplexing tendency of diffusion models: they often display the infinite Lipschitz property of the network with respect to time variable near the zero point. We provide theoretical proofs to illustrate the presence of infinite Lipschitz constants and empirical results to confirm it. The Lipschitz singularities pose a threat to the stability and accuracy during both the training and inference processes of diffusion models. Therefore, the mitigation of Lipschitz singularities holds great potential for enhancing the performance of diffusion…
Peer Reviews
Decision·ICLR 2024 oral
This paper highlights a unique and previously unexplored challenge with DDPM: the instability encountered when learning $\epsilon_{\theta} = \sigma_{t} \cdot \nabla \log q_{t}(x)$ during the time steps where $\sigma_{t}$ is minimal. One might naturally question why DDPM doesn't directly learn $\nabla \log q_{t}(x)$. I conjecture that the optimization process for learning $\nabla \log q_{t}(x)$, which involves solving $E\|\nabla \log q_{t}(x) - \frac{1}{\sigma_{t}} \|^2$, becomes problematic with
- One minor suggestion is to avoid saying $t$ being small (rather, it is about $\sigma_{t}$ being small). Since $t$ is in fact $0, 1, 2, 3, .. 100.$ - May add more discussions to the alternative approaches (see Questions below). - It may be worth showing that directly learning $\nabla \log q_{x}(t)$ with the least square is prohibitve.
This is an excellent paper, and the presentation is very well carried out. The authors point out a very interesting theoretical property that could explain some practical instabilities encountered in DDPM samples. They then present a practical solution to the problem. The authors' contribution is excellent for the community, as reducing the instabilities in the generative process, such as diffusion models, has important practical consequences.
This paper as it is impeccable in terms of presentation and contribution, both theoretically and practically. The only drawback is that no open-source code is available to experiment with their approach.
1) The method proposed is simple to implement. 2) The method clearly demonstrates significant empirical benefit. 3) The authors discuss alternative proposals and show these are less effective
1) The only weakness I would like to highlight is the discussion of the alternative methods presented. - I believe 1 of the methods from the appendix is not mentioned in the main text - namely the Remp method (D.3.3). - It would be nice to see an expanded discussion of these with some small experiment to show the quantitative difference between the proposed method and these other methods. I appreciate the space limitation, but I think this is really an interesting point.
Code & Models
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neuroimaging Techniques and Applications · Stochastic Gradient Optimization Techniques
MethodsDiffusion
