Closing the ODE–SDE gap in score-based diffusion models through the Fokker–Planck equation

Teo Deveney; Jan Stanczuk; Lisa Kreusser; Chris Budd; Carola-Bibiane Schönlieb

PMC · DOI:10.1098/rsta.2024.0503·June 5, 2025

Closing the ODE–SDE gap in score-based diffusion models through the Fokker–Planck equation

Teo Deveney, Jan Stanczuk, Lisa Kreusser, Chris Budd, Carola-Bibiane Schönlieb

PDF

Open Access

TL;DR

This paper explains why ODE-based samplers in diffusion models perform worse than SDE-based ones and proposes a method to improve them using the Fokker–Planck equation.

Contribution

The paper introduces a theoretical framework linking ODE and SDE dynamics via the Fokker–Planck equation and proposes a regularization method to reduce their performance gap.

Findings

01

The difference between ODE and SDE samplers is linked to the Fokker–Planck residual.

02

Adding the Fokker–Planck residual as a regularization term improves ODE sampler performance.

03

Improving ODE samplers can sometimes degrade SDE sample quality.

Abstract

Score-based diffusion models have emerged as one of the most promising frameworks for deep generative modelling, due to both their mathematical foundations and their state-of-the art performance in many tasks. Empirically, it has been reported that samplers based on ordinary differential equations (ODEs) are inferior to those based on stochastic differential equations (SDEs). In this article, we systematically analyse the difference between the ODE and SDE dynamics of score-based diffusion models and show how this relates to an associated Fokker–Planck equation. We rigorously describe the full range of dynamics and approximations arising when training score-based diffusion models and derive a theoretical upper bound on the Wasserstein 2-distance between the ODE- and SDE-induced distributions in terms of a Fokker–Planck residual. We also show numerically that conventional score-based…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Chemicals1

SDE

Diseases1

TD

Figures2

Click any figure to enlarge with its caption.

Left to right our three examples are a Gaussian mixture, a concentric circles distribution and a checkerboard distribution.

Distributions of pθSDE(⋅,0) and pθODE(⋅,0) for weighting parameters wR taking values in (0,0.1,1,10) . The rows indicate which weighting parameter was used, while the columns indicate whether the displayed distribution is of pθSDE(⋅,0) or of pθODE(⋅,0) in the corresponding experiment. Samples displayed from the ODE and SDE samplers were attained using the same score model.

Equations24

Funding2

—Engineering and Physical Sciences Research Councilhttp://dx.doi.org/10.13039/501100000266
—Wellcome Trusthttp://dx.doi.org/10.13039/100010269

Keywords

score-based diffusionFokker–Planckgenerative modellingWasserstein distance

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks · Advanced Neuroimaging Techniques and Applications

Full text

Introduction

Generative modelling, the task of approximating the distribution underlying some given dataset, is useful in a range of scientific and non-scientific applications. The current state-of-the-art are diffusion models [1,2], which obtain this approximation by perturbing data with white noise and learning to iteratively denoise the perturbed data. Around the same time score-based models [3,4], which approximate the distribution through the gradient of its log-density (the score function), showed impressive results when combined with Langevin-based sampling. Today these frameworks have been unified into a single score-based diffusion approach [5], where stochastic differential equations (SDE) or ordinary differential equations (ODE) driven by score models are applied to denoise perturbed data. This has received a lot of attention from both theoretical and applied communities due its strong mathematical foundation and state-of-the-art performance [5,6]. In practice, differences between SDE and ODE-based sampling distributions are observed even for a common score model, thus motivating this work.

The SDE formulation arises from the conversion of data into noise through a diffusion process. As score-based diffusion is used for data generation, the time reversal of this process, i.e. the conversion from noise to data, is crucial and has a closed-form expression that depends on the (unknown) time-dependent score function of this diffusion. Hence, data generation is achieved by approximating these reverse dynamics through a neural approximation of the score function. The ODE framework, which originates from a diffusion-free reparameterization of the Fokker–Planck equation, offers a deterministic sampling method that offers significant theoretical and computational benefits such as tractable likelihood computation and access to more efficient ODE integrators for faster sample generation. However, the existing literature indicates that the ODE-based samplers are inferior to the SDE-based samplers in diffusion models. This is evidenced empirically in [5], reporting lower Fréchet inception distances (an image quality metric) for ODE-based samplers in their experiments. Moreover, theoretical analysis corroborates this observation, where tighter upper bounds have been derived for SDE-based sampling than for the ODE [7–11]. These discrepancies in performance raise questions about the validity of the likelihood computations attained in practice and prompt us to investigate the reasons for the discrepancy between SDE- and ODE-induced distributions.

In this article, we investigate the ODE–SDE gap in score-based diffusion models theoretically by analysing their connection to the mechanism underpinning their relationship—the Fokker–Planck equation. Our aim is to expose the theoretical insight that the ODE–SDE gap is related to how well the score model approximates the solution to a Fokker–Planck equation. As such, our work makes use of tools from the analysis of partial differential equations (PDEs), combined with methods for ODEs and SDEs. In addition, we provide numerical experiments using the Fokker–Planck equation to construct a regularizer. For toy examples in two dimensions, we attain explicit visual comparisons of the distributions generated by ODE and SDE samples and measure the relevant Wasserstein distances. We emphasize that the objective of our numerics is to accurately support our theory in an interpretable way. Since our approach to the numerics is more costly than a traditional diffusion model to train (though the cost of sampling from the trained model is unchanged), it is not proposed to be a scalable approach to training high-dimensional models, and we refer to [12] for more scalable approximate approaches in this direction.

Related work

(a)

The deterministic ODE dynamics for score-based diffusion models were introduced in [5]. To compare the ODE and SDE distributions, the authors in [5] show that under a perfect score approximation, the SDE and ODE distributions coincide and derive a method for computing the likelihoods based on the ODE formulation. In the same work, it is empirically reported that the ODE sampler exhibits inferior performance. This empirical finding highlights the necessity for a more rigorous theoretical investigation into this phenomenon. In [13], the authors bound the Kullback–Leibler divergence between the SDE-induced model distribution and the target true distribution in terms of the score-matching objective minimized during training. However, they also point out that the same bound does not hold for the ODE-induced distribution.

This issue is further explored in [14], where the authors introduce a new equality that can be used for bounds of the Kullback–Leibler divergence between the ODE-induced distribution and the data-generating distribution. Their findings reveal that the conventional score matching objective, typically employed in score-based diffusion models, fails to adequately control the error in the ODE distribution, and an alternative objective is proposed.

Since then, both the ODE and SDE formulation have been analysed. For SDE-based sampling, theoretical convergence guarantees including polynomial-in-time convergence under an $[eqn]$ score approximation have been proven [7–11]. For ODE-sampling, fast convergence (polynomial-in-time) has only been shown with the presence of Langevin-based correction steps [15]. Unfortunately, this approach results in a stochastic sampler that sacrifices deterministic mappings and is unsuitable for likelihood computations. The analysis of the fully deterministic system includes [16], which establishes an upper bound based on the number of steps of the discretized ODE. In addition, [17] provides error bounds for the flow matching method introduced in [18], a generalization of diffusion-based ODE methods. Of these works, the bound in [17] looks most similar to ours, though none are directly applicable to our setting, since they are all comparisons to the ground truth given in terms of an $[eqn]$ score approximation, whereas we investigate the ODE–SDE gap in terms of a Fokker–Planck residual.

More closely related to our work, the authors in [12] relate the Kullback–Leibler divergence between ODE-induced and data-generating distributions to the error in the Fokker–Planck equation associated with the diffusion process. They demonstrate that the residual of the Fokker–Planck equation can bound the ODE sample error up to some non-zero threshold. For full convergence, their analysis additionally requires minimization of the score-matching objective.

Our work shares some themes with [12], since we also consider the Fokker–Planck equation underlying the diffusion dynamics. However our focus here differs, as we specifically focus on the ODE–SDE gap. Moreover, our analysis has been conducted independently using a different theoretical toolbox, and this reveals different insights as highlighted in §1b.

Contributions

(b)

In this work, we provide a concise and rigorous exposition of the full range of densities and their approximations that arise in the score-based diffusion framework. This includes densities of the true and approximate dynamics—both deterministic and stochastic, as well as forward and backwards in time—with their density evolution equations and the neural approximations producing them. Given these dynamics, the contributions of this work are the following:

—We derive an upper bound between the densities induced by the approximate ODE and the approximate SDE. To the authors’ knowledge, this is the first such bound that does not relate each approximated density to the true one. The distance is key when querying both the ODE and the SDE of the same model, for example when stochastic sampling is combined with likelihood evaluations.

—Our bound is in Wasserstein-distance, in contrast to previous works that mostly focus on weaker distances such as Kullback–Leibler divergence or total variation.
—Our bound relates the distributions through the Fokker–Planck equation of the generative process. This distinguishes our work from prior works in the area and is the pertinent object to study since ODE-based sampling arises through a reformulation of the Fokker–Planck equation. We show that the ODE–SDE gap increases when the neural approximation fails to satisfy a Fokker–Planck equation.
—We prove our result for both the potential parameterization, and the score parameterization more commonly used in practice, by considering the analogous Fokker–Planck equation for the score function. —We support our theory by providing numerical experiments using the residual of the Fokker–Planck equation in a regularization term. We visualize the densities and calculate the relevant Wasserstein distances to explicitly demonstrate that a lower residual error in the Fokker–Planck equation is associated with a lower ODE–SDE gap.

Outline

(c)

In §2, we describe the broad range of dynamics and approximations that arise when training a score-based diffusion model. Our main theoretical result on the ODE–SDE gap in score-based diffusion models is proven in §3, where we derive an upper bound on the Wasserstein 2-distance between the ODE- and SDE-induced distributions in terms of a Fokker–Planck residual. In §4, we provide numerical evidence showing explicitly that conventional score-based diffusion models can exhibit significant differences between SDE- and ODE-induced distributions. Moreover, we show that reducing the Fokker–Planck residual by adding it as an additional regularization term indeed leads to closing the gap between SDE and ODE distributions.

Score-based diffusion models

Assumptions and notation

(a)

We will work in the time domain $[eqn]$ for some $[eqn]$ and spatial domain $[eqn]$ . For two vectors $[eqn]$ , we denote their inner product $[eqn]$ , with associated norm $[eqn]$ . For a function $[eqn]$ , we denote the $[eqn]$ -norm over some domain $[eqn]$ as $[eqn]$ . We denote by $[eqn]$ the (spatial) gradient and by $[eqn]$ the Laplacian. For two probability measures, $[eqn]$ on $[eqn]$ , we denote their Wasserstein 2-distance by $[eqn]$ .

Let $[eqn]$ be a probability space and let $[eqn]$ be the natural filtration (the increasing family of sub- $[eqn]$ -algebras containing information at times $[eqn]$ ). As is convention, we denote by $[eqn]$ a Brownian motion at time $[eqn]$ with values in $[eqn]$ adapted to the filtration $[eqn]$ . Conversely, let $[eqn]$ denote a reverse filtration (the decreasing family of sub- $[eqn]$ -algebras containing information at times $[eqn]$ ). We denote by $[eqn]$ a Brownian motion at time $[eqn]$ with values in $[eqn]$ adapted to $[eqn]$ . Under suitable assumptions, SDEs driven by $[eqn]$ are adapted to $[eqn]$ , and SDEs driven by $[eqn]$ are adapted to $[eqn]$ . Throughout we will refer to the former as forward SDEs, and the latter as reverse SDEs even though both SDEs will initially be formulated using the forward time variable $[eqn]$ . When dealing with SDEs and the associated Fokker–Planck equations, we will distinguish between evolution equations running forward and backwards in time by introducing the reverse time variable $[eqn]$ to specify that the corresponding dynamics are in reverse time. We will denote the SDE dynamics parameterized with the forward and reverse time variables $[eqn]$ and $[eqn]$ by $[eqn]$ and $[eqn]$ , respectively, with $[eqn]$ . For any function $[eqn]$ , we introduce $[eqn]$ by $[eqn]$ for all $[eqn]$ . Further, let probability densities $[eqn]$ on $[eqn]$ be given, and we denote the associated log-densities by $[eqn]$ , $[eqn]$ on $[eqn]$ . Throughout the article, we make the following regularity assumptions:

Assumptions 2.1. Let $[eqn]$ and let $[eqn]$ such that $[eqn]$ for some $[eqn]$ . Assume that $[eqn]$ and there is $[eqn]$ such that $[eqn]$ for all $[eqn]$ . We assume that $[eqn]$ is a bounded domain with $[eqn]$ . For neural approximations, we assume smooth activation functions, so that neural potential models $[eqn]$ are in $[eqn]$ throughout, and neural score models $[eqn]$ are in $[eqn]$ . Moreover, we assume that there are $[eqn]$ such that $[eqn]$ and $[eqn]$ . Finally, we assume that the second moments of $[eqn]$ and $[eqn]$ are finite, and that $[eqn]$ .

Particle dynamics

(b)

We introduce the forward SDE as

[eqn]

equipped with some initial distribution $[eqn]$ for $[eqn]$ . In generative modelling settings, this initial distribution $[eqn]$ represents the underlying target distribution from which the data were sampled. In equation (2.1), $[eqn]$ denotes the value of a Brownian motion adapted to $[eqn]$ , and therefore $[eqn]$ is also adapted to $[eqn]$ . We denote the associated marginal density of samples from equation (2.1) at time $[eqn]$ by $[eqn]$ with $[eqn]$ . Note that equation (2.1) has a unique $[eqn]$ -continuous solution by assumptions 2.1.

In [19], the author shows that the process in equation (2.1) can be written as an SDE measurable with respect to the reverse filtration $[eqn]$ . We refer to this SDE as the reverse SDE, and it is given by

[eqn]

where $[eqn]$ is a Brownian motion adapted to $[eqn]$ at time $[eqn]$ . Intuitively, one can think of $[eqn]$ as the backwards evolution of Brownian motion with known terminal state, and equation (2.2) as the backwards evolution of equation (2.1). Accordingly, if the terminal distribution for $[eqn]$ is set to $[eqn]$ , then the trajectories of equation (2.2) share the same distribution as equation (2.1) for any time $[eqn]$ . As shown in [5], a reformulation of the Fokker–Planck equations allows us to derive the probability flow ODE of the forward SDE equation (2.1). This is given by

[eqn]

equipped with initial distribution $[eqn]$ for $[eqn]$ or, equivalently, terminal distribution $[eqn]$ for $[eqn]$ . The trajectories initialized from $[eqn]$ evolve forward in time according to equation (2.3) and also have marginal distribution $[eqn]$ at time $[eqn]$ . Similarly, the trajectories with terminal condition $[eqn]$ sampled from $[eqn]$ have marginal distribution $[eqn]$ at time $[eqn]$ . Therefore, we have that the associated densities toequations equations (2.1), (2.2) and (2.3) are all given by $[eqn]$ at any time $[eqn]$ .

Neural approximation

(c)

For generative tasks, practitioners assume $[eqn]$ to be equal to a given prior distribution $[eqn]$ and simulate equation (2.2) or (2.3) to generate samples from $[eqn]$ . Typically, $[eqn]$ approximates $[eqn]$ and is an easy to sample from distribution that contains no information of $[eqn]$ , such as a Gaussian distribution with fixed mean and variance. However, solving equation (2.2) or (2.3) requires knowledge of the (Stein) score function $[eqn]$ for any $[eqn]$ , which is not known in general and must be approximated from data. Therefore, a neural network $[eqn]$ with model parameters $[eqn]$ is trained to approximate the score function from the data by minimizing the weighted score matching objective:

[eqn]

where $[eqn]$ is a positive weighting function.

$[eqn]$ in equation (2.4) cannot be minimized directly since we do not have access to the ground truth score $[eqn]$ . Therefore, in practice, a different objective has to be used [3,5,20]. In [5], the weighted denoising score-matching objective is considered, which is defined as

[eqn]

The difference between equations equations (2.4) and (2.5) is the replacement of the unknown ground truth score $[eqn]$ by the score of the perturbation kernel $[eqn]$ , which can be determined analytically for many choices of forward SDEs. Note that for a fixed function $[eqn]$ , objective (2.5) is equal to objective (2.4) up to an additive constant, which does not depend on the model parameters $[eqn]$ . The reader can refer to [20] for the proof.

The choice of the weighting function $[eqn]$ determines the importance of score-matching at different noise scales. A principled choice is $[eqn]$ , known as the likelihood weighting due to its relation to likelihood-based training (see discussion in appendix E).

Most implementations of neural score approximations parameterize the time-dependent score vector field directly with a neural network $[eqn]$ on some bounded domain $[eqn]$ . Such approximations generally result in a non-conservative vector, which therefore cannot be a gradient field of any scalar field (see, for example, figure 5 in appendix D). Since we know a priori that the target vector field $[eqn]$ is a gradient field, instead of learning $[eqn]$ , we consider a neural network $[eqn]$ such that $[eqn]$ approximates the log-density $[eqn]$ for any $[eqn]$ up to some normalizing constant. In other words, there exists a (time-dependent) normalizing constant $[eqn]$ such that

[eqn]

is a probability distribution. We write $[eqn]$ for the induced log-density, and we call the function $[eqn]$ a potential model. During training, the induced approximate score $[eqn]$ is computed by back-propagation through $[eqn]$ with respect to the input $[eqn]$ . This results in a score approximation that is provably a conservative vector field. Moreover, it enables us to calculate the time derivative of the approximate log-density (up to normalization) as $[eqn]$ by back-propagation through $[eqn]$ with respect to $[eqn]$ , which will be convenient when we introduce and evaluate a log-Fokker–Planck residual for $[eqn]$ in §2f.

Approximate particle dynamics

(d)

The above neural approximations $[eqn]$ induce approximate versions of equation (2.2) and its deterministic flow equation (2.3). For ease of notation, we introduce the approximated reverse drift:

[eqn]

obtained by substituting the potential model into the drift of equation (2.2). Note that by the assumed properties of $[eqn]$ in assumptions 2.1, it follows that $[eqn]$ and $[eqn]$ . Using the approximated reverse drift equation (2.6) , we obtain the reverse approximate SDE:

[eqn]

which can be regarded as an approximation of equation (2.2). Here, $[eqn]$ is adapted to the reverse time filtration $[eqn]$ and by assumptions 2.1, equation (2.7) has a unique $[eqn]$ -continuous solution. We denote the marginal density of $[eqn]$ satisfying equation (2.7) by $[eqn]$ at time $[eqn]$ and equip it with some terminal distribution $[eqn]$ of $[eqn]$ at time $[eqn]$ , i.e. $[eqn]$ , where $[eqn]$ is chosen to be a Gaussian approximation of $[eqn]$ . Thus the accuracy of the reverse flow of probability $[eqn]$ induced by equation (2.7) depends on the accuracy of the potential model. Applying the result of [19] to write equation (2.7) as a process measurable with respect to $[eqn]$ , we arrive at the forward approximate SDE, given by

[eqn]

where $[eqn]$ is drawn from $[eqn]$ . The associated probability flow ODE of the approximate SDE (in forward time) is

[eqn]

where $[eqn]$ is drawn from $[eqn]$ . Note that the associated densities to equations (2.7), (2.8) and (2.9) are all given by $[eqn]$ for $[eqn]$ .

Finally, an approximation of the probability flow ODE equation (2.3) arises by approximating $[eqn]$ in equation (2.3) by a neural network $[eqn]$ . This yields the approximate probability flow ODE (in forward time):

[eqn]

using the approximate forward drift

[eqn]

Here, $[eqn]$ is distributed according to $[eqn]$ . We denote the associated density $[eqn]$ for $[eqn]$ .

In summary, the original formulations equations (2.1), (2.2) and (2.3) all have density $[eqn]$ , the approximations equations (2.7), (2.8) and (2.9) obtained by approximating the reverse SDE equation (2.2) all have density $[eqn]$ and the approximation of the probability flow ODE equation (2.3) has density $[eqn]$ . Moreover, there is a density $[eqn]$ implied directly by the neural approximation to log-density. In general, we have that $[eqn]$ .

For the majority of our calculations and numerics, it is more convenient to work with logarithms of densities rather than the densities themselves. For each density $[eqn]$ , we denote the associated log-density by $[eqn]$ and refer to $[eqn]$ as log-density or potential. That is, $[eqn]$ , $[eqn]$ , $[eqn]$ and $[eqn]$ for all $[eqn]$ .

In addition to considering the dynamics in forward time, we can also introduce the dynamics in reverse time. We denote the reverse time dynamics by $[eqn]$ for $[eqn]$ satisfying $[eqn]$ which implies that $[eqn]$ and $[eqn]$ for the initial and terminal conditions, respectively.

As we have a terminal condition $[eqn]$ for equation (2.7) and equation (2.7) is stated in forward time, the corresponding parameterization in reverse time can be useful for obtaining samples satisfying equation (2.7). It is given by

[eqn]

where we use the notation from §2a and the reverse time variable $[eqn]$ . We equip equation (2.12) with initial condition $[eqn]$ which is sampled from $[eqn]$ , and we denote the distribution of $[eqn]$ at time $[eqn]$ by $[eqn]$ . Note that $[eqn]$ for $[eqn]$ with $[eqn]$ . This implies that we can sample a particle from the target distribution $[eqn]$ by sampling $[eqn]$ from $[eqn]$ and solving equation (2.12) until time $[eqn]$ , for instance with the Euler–Maruyama scheme.

Similarly the ODE dynamics equation (2.10) can be written using the reverse time variable $[eqn]$ as

[eqn]

To sample from the approximate target distribution $[eqn]$ , sample an initial condition $[eqn]$ from $[eqn]$ and simulate equation (2.13) forward in time.

Fokker–Planck equations

(e)

The evolution of the densities subject to some initial or terminal condition are described by Fokker–Planck equations. For the forward SDE equation (2.1) the density $[eqn]$ obeys the forwardFokker–Planck equation:

[eqn]

on $[eqn]$ , equipped with the initial data $[eqn]$ on the full space $[eqn]$ . For our analysis, we restrict ourselves to a bounded domain $[eqn]$ with $[eqn]$ . For considering equation (2.14) on $[eqn]$ , we equip equation (2.14) with positive Dirichlet boundary conditions. Let $[eqn]$ denote a positive function that is equal to $[eqn]$ on $[eqn]$ . Note that we can assume without loss of generality that $[eqn]$ is positive on $[eqn]$ for $[eqn]$ . This yields the forward Fokker–Planck equation (2.14) on the domain $[eqn]$ with initial data $[eqn]$ restricted to $[eqn]$ and Dirichlet boundary conditions $[eqn]$ on $[eqn]$ . In addition, we set $[eqn]$ on $[eqn]$ .

The density $[eqn]$ of the approximate SDE equation (2.7) also satisfies a forward Fokker–Planck equation, which can be derived by writing the Fokker–Planck equation for the forward dynamics equation (2.8) of $[eqn]$ with appropriate terminal distribution. This gives the approximate Fokker–Planck equation (in forward time):

[eqn]

on $[eqn]$ , equipped with the terminal condition $[eqn]$ from assumptions 2.1 on the full space $[eqn]$ , i.e. $[eqn]$ for all $[eqn]$ , where $[eqn]$ is typically specified as a Gaussian approximation of $[eqn]$ . For considering equation (2.15) on a bounded domain $[eqn]$ , we introduce positive Dirichlet boundary conditions. Let $[eqn]$ denote a positive function that is equal to $[eqn]$ on $[eqn]$ . We obtain the approximate Fokker–Planck equation equation (2.15) on the domain $[eqn]$ with initial data $[eqn]$ restricted to $[eqn]$ and Dirichlet boundary conditions $[eqn]$ on $[eqn]$ . We set $[eqn]$ on $[eqn]$ .

Note that for $[eqn]$ , we always assume a fixed terminal condition at time $[eqn]$ when considering equation (2.15) in $[eqn]$ (or an initial condition when considering the evolution in reverse time $[eqn]$ ) as $[eqn]$ describes the flow of probability backwards from a Gaussian approximation $[eqn]$ of $[eqn]$ to some approximation of $[eqn]$ . Notice that equations (2.14) and (2.15) are of a similar form, apart from the different signs of the diffusion terms.

In addition to considering Fokker–Planck equations for the densities, one can also introduce log-Fokker–Planck equations for the potential. For the density $[eqn]$ satisfying the forward Fokker–Planck equation (2.14) for the forward SDE equation (2.1) and the associated potential $[eqn]$ we introduce the forward log-Fokker–Planck equation (in forward time) as

[eqn]

on $[eqn]$ . On the domain $[eqn]$ , we equip equation (2.14) with initial data $[eqn]$ restricted to $[eqn]$ and boundary conditions $[eqn]$ on $[eqn]$ .

For $[eqn]$ solving equation (2.15) in forward time, the log-density $[eqn]$ satisfies the approximate log-Fokker–Planck equation (in forward time) given by

[eqn]

on $[eqn]$ . On the domain $[eqn]$ , we again equip equation (2.17) with terminal data $[eqn]$ restricted to $[eqn]$ and boundary conditions $[eqn]$ on $[eqn]$ .

Fokker–Planck residuals

(f)

Our analysis in §3 is concerned with quantifying how the consistency of the neural approximation with the underlying Fokker–Planck equations is related to the ODE–SDE gap observed in practice. To measure this consistency, we derive a residual for the log-Fokker–Planck equation governing the evolution of the potential. In our theory, we use the residual as an error measure, and in our numerics, we add it as a regularization term to couple with the denoising score-matching objective $[eqn]$ in equation (2.5). Details on the implementation are given in §4.

Restricting ourselves to a bounded domain $[eqn]$ with $[eqn]$ , we consider a neural approximation $[eqn]$ with parameters $[eqn]$ , which learns the solution $[eqn]$ of the approximate log-Fokker–Planck equation (2.17) such that $[eqn]$ satisfies appropriate Dirichlet boundary conditions $[eqn]$ and terminal condition $[eqn]$ . This boundary condition ensures that the solution to equation (2.17) on Ω is everywhere equal to its corresponding solution on an unbounded domain.

Note that setting $[eqn]$ in equation (2.6) results in the probability flow ODE of the approximate SDE equation (2.9) (with density $[eqn]$ ) and approximate probability flow ODE equation (2.10) (with density $[eqn]$ ) coinciding. Thus, the gap between these two generative processes relates to the consistency of our neural approximation with the log-Fokker–Plank equation (2.17) associated with the approximate SDE. To measure the Fokker–Planck consistency, we substitute our model $[eqn]$ into equation (2.17) and measure the differential operator residual in the $[eqn]$ -norm. Manipulating this residual, we obtain

[eqn]

on $[eqn]$ . This demonstrates that the residual for the forward log-Fokker–Planck equation (2.16) is equivalent to the residual for the approximate log-Fokker–Planck equation (2.17) , thus it is sufficient to only consider the residual of the forward equation. Hence, we define the residual of the log-Fokker–Planck equation for the approximate reverse SDE equation (2.7) for any $[eqn]$ as

[eqn]

where $[eqn]$ is the volume of $[eqn]$ . We refer to $[eqn]$ as the log-Fokker–Planck residual. This residual quantifies how well our approximation $[eqn]$ agrees with the true solution $[eqn]$ to the approximate log-Fokker–Planck equation. In §3, we show how the values attained by this residual define an upper bound on the ODE–SDE discrepancy.

Note that analogous calculations hold for the residual in the standard Fokker–Planck equation (2.15) , and the score-Fokker–Planck equation discussed in §3c. In all cases, the residual corresponding to their forward Fokker–Planck equations is equal to the residual of the Fokker–Planck equation of the generative process, and these residuals relate to the ODE–SDE gap.

We remark that the bounded domain assumption made at the beginning of this subsection arises from the PDE analysis we apply to relate the neural approximation to the Fokker–Planck equation. This contrasts with other works deriving bounds for diffusion models since they do not investigate this relation, typically only considering generation under an $[eqn]$ -score approximation. In practice, this assumption is not restrictive, since for any data distribution $[eqn]$ and $[eqn]$ it is possible to choose a bounded domain $[eqn]$ such $[eqn]$ for all times $[eqn]$ . Therefore, the chance that a trajectory escapes this domain in practice is diminished.

Theoretical results on the ODE–SDE gap

In this section, we investigate the gap between the ODE- and SDE-induced distributions in terms of Fokker–Planck equations. More precisely, we derive bounds related to the approximate log-Fokker–Planck equation (2.17) in §3a, and in §3b we show that this theory applies to the associated potential model. In §3c, we provide a sketch of how to derive analogous bounds in terms of the approximate score-Fokker–Planck equation, thus addressing the common score parameterization. All these results are based on the following:

Assumptions 3.1. Let $[eqn]$ and $[eqn]$ be given, let $[eqn]$ be a bounded domain with $[eqn]$ . Assume that $[eqn]$ satisfies equation (2.15) on $[eqn]$ with terminal condition $[eqn]$ restricted to $[eqn]$ and Dirichlet boundary conditions $[eqn]$ on $[eqn]$ , with $[eqn]$ . Further, let $[eqn]$ be the probability density associated with equation (2.10) with terminal condition $[eqn]$ restricted to $[eqn]$ .

The ODE–SDE gap for the approximate Fokker–Planck equation

(a)

We show that, at a fixed time $[eqn]$ , $[eqn]$ satisfying the approximate Fokker–Planck equation equation (2.15) converges to the density $[eqn]$ of the approximate probability flow ODE equation (2.10) with respect to the Wasserstein 2-distance $[eqn]$ as the log-Fokker–Planck residual $[eqn]$ in equation (2.19) goes to zero

Theorem 3.1. Assume that assumptions 2.1 and 3.1 hold. Further, assume that the neural network $[eqn]$ is determined such that $[eqn]$ obeys the terminal condition $[eqn]$ restricted to $[eqn]$ and Dirichlet boundary conditions $[eqn]$ on $[eqn]$ , and $[eqn]$ in (2.19) satisfies $[eqn]$ . Then, $[eqn]$ for some constant $[eqn]$ independent of $[eqn]$ .

We provide the proof of theorem 3.1 in appendix A. The constant $[eqn]$ in theorem 3.1 depends on the time horizon $[eqn]$ as well as the Lipschitz constants of $[eqn]$ and $[eqn]$ . Details of these dependencies are specified in the proof.

The ODE–SDE gap for the potential model associated with the approximate Fokker–Planck equation

(b)

A key benefit of score-based models is that scores are agnostic to multiplicative scaling of the underlying density, implying known normalizing constants are not required for their implementation. So far, we have implicitly assumed that $[eqn]$ is an approximation of $[eqn]$ for a density $[eqn]$ , and hence that the integral of $[eqn]$ is normalized which is non-trivial in practice. To overcome this issue, we introduce an unnormalized network $[eqn]$ as a potential model and relate it to $[eqn]$ by introducing a (potentially time-varying) normalizing constant $[eqn]$ for $[eqn]$ . This gives the relation

[eqn]

which we also use to obtain the terminal data $[eqn]$ and boundary conditions $[eqn]$ on $[eqn]$ . The bounds on the ODE–SDE gap in theorem 3.1 also hold when considering $[eqn]$ instead of $[eqn]$ in the Fokker–Planck residual equation (2.19), as shown in appendix B.

The ODE–SDE gap for the approximate score-Fokker–Planck equation

(c)

In this work, we primarily focus on the underlying connection between the ODE, the SDE and Fokker–Planck equations observed in score-based diffusion models, and thus our focus has been on the potential parameterization discussed so far. However, in most practical implementations, a score parameterization is adopted due to computational efficiency, given by $[eqn]$ for the density $[eqn]$ and the log-density $[eqn]$ of the forward SDE equation (2.1). The score parameterization is linked to the score-Fokker–Planck equation, see e.g.[12]. To ensure applicability of our results to this case, we argue in this section that the bounds on the ODE–SDE gap in theorem 3.1 also hold for the score parameterization.

A score-Fokker–Planck equation can be derived by taking the gradient of the associated log-Fokker–Planck equation. To derive an analogous result to theorem 3.1, we are interested in the score-Fokker–Planck equation of the approximate SDE with log-Fokker–Planck equation (2.17) . Taking the gradient of equation (2.17) and setting $[eqn]$ yields the approximate score-Fokker–Planck equation:

[eqn]

Here, we use analogous notation to the potential case so that $[eqn]$ is the true score associated with the approximate reverse SDE equation (2.7) and is linked with the density $[eqn]$ via $[eqn]$ . Similarly, we also introduce the score $[eqn]$ associated with the approximate probability flow ODE equation (2.10). As before, we consider appropriate terminal data and Dirichlet boundary conditions.

Let $[eqn]$ denote a score model approximating $[eqn]$ . Following an analogous calculation to equation (2.18), the residual corresponding to the approximate score-Fokker–Planck equation (3.2) can be written as the residual of a score-Fokker–Planck equation, and we define the score-Fokker–Planck residual by

[eqn]

We can now state analogous result to theorem 3.1 for the score-Fokker–Planck equation:

Theorem 3.2. Assume that assumptions 2.1 and 3.1 hold. Further, assume that $[eqn]$ is determined such that $[eqn]$ and equipped with appropriate terminal and Dirichlet boundary conditions. Then, $[eqn]$ for some $[eqn]$ independent of $[eqn]$ .

Theorem 3.2 can be derived by applying analogous Steps I and II in the proof of theorem 3.1 and generalizing them to vector-valued functions as appropriate. More precisely, Step I of the proof has to be generalized to vector-valued solutions $[eqn]$ of equation (3.2) as opposed to the scalar solution $[eqn]$ of equation (2.17) but due to the similarity of the equations, this step can be done analogously. Step II follows as in the proof of theorem 3.1. Due to the similarity of the proofs, the detailed proof is omitted here.

Numerical experiments

To demonstrate our analytical results numerically and ensure visual interpretability, we implement several diffusion models in $[eqn]$ that attain a range of log-Fokker–Planck residual values (equation (2.19)) using various toy datasets. For the forward SDE, we choose $[eqn]$ and $[eqn]$ , resulting in the simple Ornstein–Uhlenbeck process $[eqn]$ Following equation (2.16), the associated log-Fokker–Planck equation is given by $[eqn]$ In our experiments, we take three different data distributions and train a neural network to minimize the loss function $[eqn]$ for differing values of $[eqn]$ , where $[eqn]$ and $[eqn]$ are defined in equations (2.5) and equation (2.19), respectively, and $[eqn]$ is set according to the likelihood weighting. Note that for our specific setting, we have

[eqn]

Therefore, both the denoising score matching objective $[eqn]$ and the log-Fokker–Planck residual $[eqn]$ are approximated using Monte Carlo estimation. We set $[eqn]$ and the likelihood weighting implies $[eqn]$ for $[eqn]$ . Note that we do not add terms that enforce boundary conditions in space or time, since the denoising score matching objective (2.5) already encourages consistency with these conditions (up to a multiplicative constant proportional to the underlying density). We generate samples from $[eqn]$ and $[eqn]$ using Euler–Maruyama and Euler discretizations of the reverse approximate SDE equation (2.7) and the reverse approximate probability flow ODE equation (2.10), respectively. To validate our results, we generate three million samples from each distribution. Due to computational constraints, these samples are then discretized on to a $[eqn]$ grid. When computing the Wasserstein distances to the target distribution and producing visualizations, we will consider the discretized distributions. Figure 1 shows the target distributions for our experiments.

Left to right our three examples are a Gaussian mixture, a concentric circles distribution and a checkerboard distribution.

We choose two-dimensional examples to ensure we can explicitly see the behaviour of the distributions and measure $[eqn]$ distances. Our examples cover the analytically solvable Gaussian mixture case, a smooth concentric circles distribution and a discontinuous checkerboard distribution to cover a range of scenarios of potential interest.

We parameterize our potential model $[eqn]$ by a fully connected neural network with two hidden layers of 80 nodes. We apply softplus activation functions, which have well defined first and second derivatives as required to evaluate $[eqn]$ . Each model is trained for 100 000 iterations using Adam with learning rate decaying from $[eqn]$ down to $[eqn]$ . Figure 2 shows the samples obtained from $[eqn]$ and $[eqn]$ for different weighting parameters $[eqn]$ .

Distributions of pθSDE(⋅,0) and pθODE(⋅,0) for weighting parameters wR taking values in (0,0.1,1,10) . The rows indicate which weighting parameter was used, while the columns indicate whether the displayed distribution is of pθSDE(⋅,0) or of pθODE(⋅,0) in the corresponding experiment. Samples displayed from the ODE and SDE samplers were attained using the same score model.

We see from figure 2 that if we only optimize $[eqn]$ (i.e. for $[eqn]$ ) the resulting $[eqn]$ is quite different from the true distribution. Notably, areas of high probability in $[eqn]$ do coincide with high probability regions of $[eqn]$ . Therefore in typical generative modelling scenarios, it may be difficult to identify this mischaracterization of the data distribution, given that individual samples generated from $[eqn]$ are generally plausible. Visually, we see that adding a factor of $[eqn]$ to the loss function initially results in an improvement in $[eqn]$ . The tables and figures inappendix C further detail these results. In table 2, we see that the distance between $[eqn]$ and $[eqn]$ consistently reduces for $[eqn]$ and $[eqn]$ when compared with $[eqn]$ , indicating an improvement in ODE samples. Increasing $[eqn]$ beyond this further reduces the gap between $[eqn]$ and $[eqn]$ as listed in table 1; however, this comes at the cost of increasing the distance from both $[eqn]$ and $[eqn]$ to $[eqn]$ which can be observed in tables 2 and 3 for $[eqn]$ . This can clearly be seen in figure 2 by the overly smoothed distributions that are attained with higher $[eqn]$ . Table 3 lists that the quality of $[eqn]$ degrades monotonically with increasing $[eqn]$ , which results in the negative correlation between $[eqn]$ and $[eqn]$ observed in figure 4. From this, we conclude that the cost of improving $[eqn]$ is a reduction in the quality of $[eqn]$ . Finally, in figure 3, we evaluate $[eqn]$ for each of our trained models and visualize the relation between $[eqn]$ and the associated $[eqn]$ values. This demonstrates a clear positive correlation supporting our theoretical analysis, where we proved an upper bound on the Wasserstein 2-distance between the ODE- and SDE-induced distributions in terms of a log-Fokker–Planck residual $[eqn]$ .

Conclusions

In this work, we conducted a systematic investigation into the dynamics that arise in score-based diffusion models. We mainly focused on the differences between the generative densities $[eqn]$ and $[eqn]$ defined by the reverse approximate SDE and the approximate probability flow ODE, respectively. Analytically, we proved that the discrepancy between $[eqn]$ and $[eqn]$ can be bounded by a log-Fokker–Planck residual in the Wasserstein 2-distance, thus giving a deeper insight into the connection between the two generative distributions in terms of the Fokker–Planck dynamics underlying the diffusion process. Numerically, we showed that $[eqn]$ and $[eqn]$ can differ substantially when the neural network is trained using the standard score-matching objective. Our numerical experiments also demonstrate that penalizing the loss function by the log-Fokker–Planck residual indeed leads to closing the gap between the ODE and the SDE distributions in the Wasserstein 2-distance. Our findings revealed that imposing this additional constraint within our loss function could improve the quality of $[eqn]$ when compared with the ground truth, though in exchange for this we observed concurrent degradation in the quality of $[eqn]$ . The practical implication of these findings is that enforcing self-consistency through penalization by the log-Fokker–Planck residual is unlikely to improve state-of-the-art generation using stochastic samplers. However, for downstream tasks where deterministic generation is required, such penalization could provide a potential avenue to improve sample quality and likelihood accuracy.

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ho J , Jain A , Abbeel P . 2020 Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33 , 6840–6851.
2Sohl-Dickstein J , Weiss E , Maheswaranathan N , Ganguli S . 2015 Deep unsupervised learning using nonequilibrium thermodynamics (eds B Francis , D Blei ). In Proc. of the 32nd Int. Conf. on Machine Learning. Proc. of Machine Learning Research, vol. 37, pp. 2256–2265, Lille, France: PMLR.
3Hyvärinen A . 2005 Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 6 , 695–709.
4Song Y , Ermon S . 2019 Generative modeling by estimating gradients of the data distribution. In Proc. of the 33rd Int. Conf. on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc.
5Song Y , Sohl-Dickstein J , P.Kingma D , Kumar A , Ermon S , Poole B . 2021 Score‑based generative modeling through stochastic differential equations. In 9th international conference on learning representations. Open Review.net. See https://openreview.net/forum?id=Px TIG 12RRHS.
6Dhariwal P , Nichol AQ . 2021 Diffusion models beat GA Ns on image synthesis. In Advances in neural information processing systems (eds M Ranzato , A Beygelzimer , Y Dauphin , PS Liang , J Wortman Vaughan ), pp. 8780–8794. Curran Associates, Inc. See https://proceedings.neurips.cc/paper_files/paper/2021/file/49ad 23d 1ec 9fa 4bd 8d 77d 02681 df 5cfa-Paper.pdf.
7Benton J , Bortoli DV , Doucet A , Deligiannidis G . 2024 Nearly d-linear convergence bounds for diffusion models via stochastic localization. In The twelfth international conference on learning representations. Vienna, Austria: Open Review.net. See https://openreview.net/forum?id=r 5nj V 3Bsu D.
8Chen S , Chewi S , Li J , Li Y , Salim A , Zhang A . 2023 Sampling is as easy as learning the score theory for diffusion models with minimal data assumptions. In The eleventh international conference on learning representations. Open Review.net. See https://openreview.net/forum?id=zy LV Mgs Z 0U\_.