Convergence Dynamics of Over-Parameterized Score Matching for a Single Gaussian
Yiran Zhang, Weihang Xu, Mo Zhou, Maryam Fazel, Simon Shaolei Du

TL;DR
This paper analyzes the convergence behavior of over-parameterized score matching models trained on a single Gaussian, providing theoretical guarantees under various initialization and noise conditions.
Contribution
It offers the first global convergence guarantees for over-parameterized score matching models learning a Gaussian, including regimes with different noise levels and initializations.
Findings
Global convergence when noise is large
Convergence with small initial parameters
Only one parameter converges with high probability from random initialization
Abstract
Score matching has become a central training objective in modern generative modeling, particularly in diffusion models, where it is used to learn high-dimensional data distributions through the estimation of score functions. Despite its empirical success, the theoretical understanding of the optimization behavior of score matching, particularly in over-parameterized regimes, remains limited. In this work, we study gradient descent for training over-parameterized models to learn a single Gaussian distribution. Specifically, we use a student model with learnable parameters and train it on data generated from a single ground-truth Gaussian using the population score matching objective. We analyze the optimization dynamics under multiple regimes. When the noise scale is sufficiently large, we prove a global convergence result for gradient descent. In the low-noise regime, we identify…
Peer Reviews
Decision·ICLR 2026 Poster
The paper makes a significant contribution towards understanding the optimization dynamics of score matching, moving beyond the typical student-teacher setting. The paper identifies a nice set-up which captures some key elements of over-parameterized deep learning while still being mathematically tractable. The results shed light on the qualitative effects of over-parameterization for score matching, in particular the sensitivity of initialization in the low-noise regime. Additionally, it is a
The only weakness I can think of is that the paper analyzes the gradient dynamics of the score matching loss at a fixed time, whereas in practice, one usually trains a single, time-averaged score matching loss. I don't see this as a major flaw of the paper, but I am curious whether the results could be extended to this setting.
This paper builds on and simplifies the model introduced in Buchanan et al., “On the Edge of Memorization in Diffusion Models”. Its main contribution lies in the analysis of the training dynamics of such models, which provides a complementary perspective to Buchanan et al.’s work. The paper also offers interesting insights into the sensitivity of these dynamics: it demonstrates that even small changes in initialization can shift the system from convergence to divergence, revealing a subtle trans
1. I am not completely convinced that this overparametrized setting, even if it can outline interesting phenomena, is the one to be considered. 2. The authors claimed that it is important in practice to include the score matching in the large noise regime. However, the value of the loss function is scaled by $\exp(-t)$, which should be negligible. 3. It may be better to merge the two separate regimes (i.e. $t$ large and $t$ small), and discuss the effect of initialization, which in my opinion g
- The paper provides a detailed convergence analysis of the gradient descent optimization in the setting where the true data distribution is Gaussian. It allows to display settings in which convergence is guaranteed with a quantified rate (large t) and settings where convergence depends on the initialization of the parameters (small t). The authors also highlight a behavior where loss convergence may still lead to non converging behavior of the parameter. - Even if the chosen true data distrib
- Of course restricting to a Gaussian case allows explicit computations which are interesting to understand deeply the behavior of the optimization process. Still, most theoretical papers have obtained theoretical guarantees in terms of KL or Wasserstein distance for strongly log concave distributions (even if this assumption has been relaxed recently). Extending to this setting or discussing the difficulty to extend the results, even if strong log concavity is still a strong assumption, would g
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Generative Adversarial Networks and Image Synthesis
