Parametric Fokker-Planck equation

Wuchen Li; Shu Liu; Hongyuan Zha; Haomin Zhou

arXiv:1903.10076·math.OC·June 16, 2020·GSI

Parametric Fokker-Planck equation

Wuchen Li, Shu Liu, Hongyuan Zha, Haomin Zhou

PDF

TL;DR

This paper derives a parametric version of the Fokker-Planck equation as a Wasserstein gradient flow on the statistical manifold, simplifying it to a finite-dimensional ODE with analytical and numerical examples.

Contribution

It introduces a novel derivation of the Fokker-Planck equation on parametric spaces, connecting PDEs with finite-dimensional ODEs on parameter manifolds.

Findings

01

Derived the parametric Fokker-Planck equation as a Wasserstein gradient flow.

02

Reduced the PDE to a finite-dimensional ODE on parameter space.

03

Provided analytical and numerical examples demonstrating the approach.

Abstract

We derive the Fokker-Planck equation on the parametric space. It is the Wasserstein gradient flow of relative entropy on the statistical manifold. We pull back the PDE to a finite dimensional ODE on parameter space. Some analytical example and numerical examples are presented.

Figures2

Click any figure to enlarge with its caption.

Equations61

\frac{\partial ρ ( t , x )}{\partial t} = \nabla \cdot (ρ (t, x) \nabla V (x)) + β Δ ρ (t, x), ρ (0, x) = ρ_{0} (x) .

\frac{\partial ρ ( t , x )}{\partial t} = \nabla \cdot (ρ (t, x) \nabla V (x)) + β Δ ρ (t, x), ρ (0, x) = ρ_{0} (x) .

d X_{t} = - \nabla V (X_{t}) + 2 β d B_{t}, X_{0} \sim ρ_{0} .

d X_{t} = - \nabla V (X_{t}) + 2 β d B_{t}, X_{0} \sim ρ_{0} .

P = {ρ : \int ρ (x) d x = 1, ρ (x) \geq 0, \int ∣ x ∣^{2} ρ (x) d x < \infty}

P = {ρ : \int ρ (x) d x = 1, ρ (x) \geq 0, \int ∣ x ∣^{2} ρ (x) d x < \infty}

T_{\rho}\mathcal{P}=\Big{\{}\dot{\rho}\colon\int\dot{\rho}(x)dx=0\Big{\}}.

T_{\rho}\mathcal{P}=\Big{\{}\dot{\rho}\colon\int\dot{\rho}(x)dx=0\Big{\}}.

g^{W} (ρ) (\overset{ρ}{˙}_{1}, \overset{ρ}{˙}_{2}) = \int \nabla ψ_{1} (x) \cdot \nabla ψ_{2} (x) ρ (x) d x,

g^{W} (ρ) (\overset{ρ}{˙}_{1}, \overset{ρ}{˙}_{2}) = \int \nabla ψ_{1} (x) \cdot \nabla ψ_{2} (x) ρ (x) d x,

grad_{W} F (ρ) = = g^{W} (ρ)^{- 1} (\frac{δ F}{δ ρ}) (x) - \nabla \cdot (ρ (x) \nabla \frac{δ}{δ ρ ( x )} F (ρ)),

grad_{W} F (ρ) = = g^{W} (ρ)^{- 1} (\frac{δ F}{δ ρ}) (x) - \nabla \cdot (ρ (x) \nabla \frac{δ}{δ ρ ( x )} F (ρ)),

F (ρ) = β \int ρ (x) lo g \frac{ρ ( x )}{e ^{- \frac{V ( x )}{β}}} d x = \int V (x) ρ (x) d x + β \int ρ (x) lo g ρ (x) d x .

F (ρ) = β \int ρ (x) lo g \frac{ρ ( x )}{e ^{- \frac{V ( x )}{β}}} d x = \int V (x) ρ (x) d x + β \int ρ (x) lo g ρ (x) d x .

\frac{\partial ρ}{\partial t} = - grad_{W} F (ρ) = \nabla \cdot (ρ \nabla V) + β \nabla \cdot (ρ \nabla lo g ρ)) .

\frac{\partial ρ}{\partial t} = - grad_{W} F (ρ) = \nabla \cdot (ρ \nabla V) + β \nabla \cdot (ρ \nabla lo g ρ)) .

G (θ) = \int \nabla Ψ (T_{θ} (x)) \nabla Ψ (T_{θ} (x))^{T} d p (x),

G (θ) = \int \nabla Ψ (T_{θ} (x)) \nabla Ψ (T_{θ} (x))^{T} d p (x),

G_{ij} (θ) = \int \nabla ψ_{i} (T_{θ} (x)) \cdot \nabla ψ_{j} (T_{θ} (x)) d p (x), 1 \leq i, j \leq m .

G_{ij} (θ) = \int \nabla ψ_{i} (T_{θ} (x)) \cdot \nabla ψ_{j} (T_{θ} (x)) d p (x), 1 \leq i, j \leq m .

\nabla \cdot (ρ_{θ} \nabla ψ_{k} (x)) = \nabla \cdot (ρ_{θ} \partial_{θ_{k}} T_{θ} (T_{θ}^{- 1} (x))) .

\nabla \cdot (ρ_{θ} \nabla ψ_{k} (x)) = \nabla \cdot (ρ_{θ} \partial_{θ_{k}} T_{θ} (T_{θ}^{- 1} (x))) .

\int ϕ (y) \frac{\partial ρ _{θ_{t}}}{\partial t} (y) d y

\int ϕ (y) \frac{\partial ρ _{θ_{t}}}{\partial t} (y) d y

= \int \dot{θ}_{t}^{T} \partial_{θ} T_{θ_{t}} (T_{θ_{t}}^{- 1} (x)) \nabla ϕ (x) ρ_{θ_{t}} (x) d x

= \int ϕ (x) (- \nabla \cdot (ρ_{θ_{t}} \partial_{θ} T_{θ_{t}} (T_{θ_{t}}^{- 1} (x))^{T} \dot{θ}_{t})) d x

(T_{\#}|_{\theta})_{*}\xi(\theta)=\frac{\partial\rho_{\theta_{t}}}{\partial t}\Bigr{|}_{t=0}=-\nabla\cdot(\rho_{\theta}~{}\partial_{\theta}T_{\theta}(T_{\theta}^{-1}(x))^{T}~{}\xi(\theta))

(T_{\#}|_{\theta})_{*}\xi(\theta)=\frac{\partial\rho_{\theta_{t}}}{\partial t}\Bigr{|}_{t=0}=-\nabla\cdot(\rho_{\theta}~{}\partial_{\theta}T_{\theta}(T_{\theta}^{-1}(x))^{T}~{}\xi(\theta))

G (θ) (ξ (θ), ξ (θ)) = g^{W} (ρ_{θ}) ((T_{#} ∣_{θ})_{*} ξ (θ), (T_{#} ∣_{θ})_{*} ξ (θ))

G (θ) (ξ (θ), ξ (θ)) = g^{W} (ρ_{θ}) ((T_{#} ∣_{θ})_{*} ξ (θ), (T_{#} ∣_{θ})_{*} ξ (θ))

\frac{\partial\rho_{\theta_{t}}}{\partial t}\Bigr{|}_{t=0}=-\nabla\cdot(\rho_{\theta}\nabla\varphi(x))

\frac{\partial\rho_{\theta_{t}}}{\partial t}\Bigr{|}_{t=0}=-\nabla\cdot(\rho_{\theta}\nabla\varphi(x))

\nabla \cdot (ρ_{θ} \nabla φ (x)) = \nabla \cdot (ρ_{θ} \partial_{θ} T_{θ} (T_{θ}^{- 1} (\cdot))^{T} ξ (θ))

\nabla \cdot (ρ_{θ} \nabla φ (x)) = \nabla \cdot (ρ_{θ} \partial_{θ} T_{θ} (T_{θ}^{- 1} (\cdot))^{T} ξ (θ))

G (θ) (ξ, ξ)

G (θ) (ξ, ξ)

= \int ∣\nabla Ψ (T_{θ} (x))^{T} ξ ∣^{2} d p (x) = ξ^{T} (\int \nabla Ψ (T_{θ} (x)) \nabla Ψ (T_{θ} (x))^{T} d p (x)) ξ

G (θ) = \int \nabla Ψ (T_{θ} (x)) \nabla Ψ (T_{θ} (x))^{T} d p (x)

G (θ) = \int \nabla Ψ (T_{θ} (x)) \nabla Ψ (T_{θ} (x))^{T} d p (x)

G (θ) = \int \partial_{θ} T_{θ} (x)^{T} \partial_{θ} T_{θ} (x) d p (x) .

G (θ) = \int \partial_{θ} T_{θ} (x)^{T} \partial_{θ} T_{θ} (x) d p (x) .

F (θ) = F (ρ_{θ}) = \int V (x) ρ_{θ} (x) d x + β \int ρ_{θ} (x) lo g ρ_{θ} (x) d x .

F (θ) = F (ρ_{θ}) = \int V (x) ρ_{θ} (x) d x + β \int ρ_{θ} (x) lo g ρ_{θ} (x) d x .

\dot{θ} = - G (θ)^{- 1} \nabla_{θ} F (θ) .

\dot{θ} = - G (θ)^{- 1} \nabla_{θ} F (θ) .

V (x) = \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ) and ρ_{0} \sim N (μ_{0}, Σ_{0}) .

V (x) = \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ) and ρ_{0} \sim N (μ_{0}, Σ_{0}) .

ρ_{θ} (x) = T_{θ}_{#} p (x) = \frac{f ( T _{θ}^{- 1} ( x ))}{∣ det ( Γ ) ∣} = \frac{f ( Γ ^{- 1} ( x - b ))}{∣ det ( Γ ) ∣}, f (x) = \frac{exp ( - \frac{1}{2} ∣ x ∣ ^{2} )}{( 2 p ) ^{\frac{d}{2}}} .

ρ_{θ} (x) = T_{θ}_{#} p (x) = \frac{f ( T _{θ}^{- 1} ( x ))}{∣ det ( Γ ) ∣} = \frac{f ( Γ ^{- 1} ( x - b ))}{∣ det ( Γ ) ∣}, f (x) = \frac{exp ( - \frac{1}{2} ∣ x ∣ ^{2} )}{( 2 p ) ^{\frac{d}{2}}} .

\nabla (\frac{δ F ( ρ _{θ} )}{δ ρ}) \circ T_{θ} (x) = \nabla (V + β lo g ρ_{θ}) \circ T_{θ} (x) = Σ^{- 1} (Γ x + b - μ) - β Γ^{- T} x

\nabla (\frac{δ F ( ρ _{θ} )}{δ ρ}) \circ T_{θ} (x) = \nabla (V + β lo g ρ_{θ}) \circ T_{θ} (x) = Σ^{- 1} (Γ x + b - μ) - β Γ^{- T} x

\dot{Γ}

\dot{Γ}

\dot{b}

G_{ij}(\theta)=\mathbb{E}_{\mathbf{X}\sim p}\Big{[}\varphi_{i}(\mathbf{X})\varphi_{j}(\mathbf{X})\Big{]}\quad 1\leq i,j\leq m

G_{ij}(\theta)=\mathbb{E}_{\mathbf{X}\sim p}\Big{[}\varphi_{i}(\mathbf{X})\varphi_{j}(\mathbf{X})\Big{]}\quad 1\leq i,j\leq m

\int\rho_{\theta}(x)\log\rho_{\theta}(x)~{}dx=\underset{h}{\mathrm{sup}}\Big{\{}\int h(x)\rho_{\theta}(x)~{}dx-\int e^{h(x)}dx\Big{\}}+1

\int\rho_{\theta}(x)\log\rho_{\theta}(x)~{}dx=\underset{h}{\mathrm{sup}}\Big{\{}\int h(x)\rho_{\theta}(x)~{}dx-\int e^{h(x)}dx\Big{\}}+1

\nabla_{θ} F (θ)

\nabla_{θ} F (θ)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: University of California, Los Angeles 22institutetext: Georgia Institute of Technology

Parametric Fokker-Planck equation

Wuchen Li 11

Shu Liu 22

Hongyuan Zha 22

Haomin Zhou 22

Abstract

We derive the Fokker-Planck equation on the parametric space. It is the Wasserstein gradient flow of relative entropy on the statistical manifold. We pull back the PDE to a finite dimensional ODE on parameter space. Some analytical example and numerical examples are presented.

Keywords:

Optimal transport Information Geometry Statistical manifold Fokker-Planck equation Gradient Flow

1 Introduction

Fokker-Planck equation, a linear evolution partial differential equation (PDE), plays a crucial role in stochastic calculus, statistical physics and modeling [14, 17, 19]. Recently, people also discover its importance in statistics and machine learning [11, 16, 18]. Fokker-Planck equation describes the evolution of density functions of the stochastic process driven by a stochastic differential equation (SDE).

There is another viewpoint of Fokker-Planck equation based on optimal transport theory. It treats the equation as the gradient flow of relative entropy on probability manifold equipped with Wasserstein metric [5, 15]. Recently, the studies have been extended to information geometry [1, 2, 3], creating a new area known as Wasserstein information geometry [7, 9, 10]. Inspired by those studies, in this paper, we derive the metric tensor on parameter space by pulling back the Wasserstein metric via the parameterized pushforward map. Then we compute the Wasserstein gradient flow (an ODE system) of relative entropy defined on parameter space. This leads to a statistical manifold version of Fokker Planck equation, which can be viewed as an approximation of the original PDE.

Our work is motivated by two purposes, (1) reducing the evolution PDE to a finite dimensional ODE system on parameter space; (2) applying parameterized pushforward map to obtain an efficient sampling method to generate samples from SDE. This is different from Markov Chain Monte Carlo (MCMC) methods [12] or momentum methods [17]. In this brief presentation, we sketch the theoretical framework with illustrations on several examples. The complete results will be reported in an extended version [13].

2 Parametric Fokker-Planck equation

In this section, we briefly review the fact that Fokker-Planck equation is a Wasserstein gradient flow of relative entropy. We then introduce a Wasserstein statistical manifold generated by parameterized mapping function. Based on it, we derive the parametric Fokker-Planck equation as the gradient flow of parameterized relative entropy.

2.1 Fokker-Planck equation

Consider the Fokker-Planck equation:

[TABLE]

Here $\nabla\cdot$ , $\nabla$ is the divergence and gradient operator in $\mathbb{R}^{d}$ , $\nabla V$ is the drift function and $\beta>0$ is a diffusion constant. There are several understandings for the equation (1).

On the one hand, consider the stochastic differential equation:

[TABLE]

Here $\{\boldsymbol{B}_{t}\}_{t\geq 0}$ is the standard Brownian motion. It is well known that the density function $\rho(t,x)$ of stochastic process $\boldsymbol{X}_{t}$ , i.e. $\boldsymbol{X}_{t}\sim\rho(t,x)$ , satisfies the Fokker-Planck equation (1).

On the other hand, equation (1) is the Wasserstein gradient flow of relative entropy. Denote the probability space supported on $\mathbb{R}^{d}$ :

[TABLE]

Equipped with the Wasserstein metric [6, 15], $\mathcal{P}$ is an infinite dimensional Riemmanian manifold. Denote

[TABLE]

Consider a specific $\rho\in\mathcal{P}$ and $\dot{\rho}_{i}\in T_{\rho}\mathcal{P}$ , $i=1,2$ . The Wasserstein metric tensor $g^{W}$ is defined as:

[TABLE]

where $\dot{\rho_{i}}=-\nabla\cdot(\rho_{i}\nabla\psi_{i})$ for $i=1,2$ . Here $g^{W}$ is a metric tensor, which is a positive definite bilinear form defined on tangent bundle $T\mathcal{P}=\{(\rho,\dot{\rho})\colon\rho\in\mathcal{P},~{}\dot{\rho}\in T_{\rho}\mathcal{P}\}$ .

The Riemannian gradient in $(\mathcal{P},g^{W})$ is given as follows. Consider a smooth functional $\mathcal{F}\colon\mathcal{P}\rightarrow\mathbb{R}$ , then

[TABLE]

where $\frac{\delta}{\delta\rho(x)}$ is the $L^{2}$ first variation at variable $x\in\mathbb{R}^{d}$ . In particular, consider the relative entropy

[TABLE]

Then $\nabla\left(\frac{\delta\mathcal{F}}{\delta\rho}\right)=\nabla V+\beta\nabla\log\rho$ , and (3) forms

[TABLE]

Notice $\nabla\log\rho=\frac{\nabla\rho}{\rho}$ , then $\nabla\cdot(\rho\nabla\log\rho)=\nabla\cdot(\nabla\rho)=\Delta\rho$ . The above equation is exactly Fokker-Planck equation (1).

From now on, we apply the above geometric gradient flow formulation and derive the Fokker-Planck equation (1) on parameter space.

2.2 Parameter space equipped with Wasserstein metric

We consider a parameter space $\Theta$ as an open set in $\mathbb{R}^{m}$ . Denote the sample space $M=\mathbb{R}^{d}$ . Suppose $T_{\theta}$ is a pushforward map from $M$ to $M$ , which is parametrized by $\theta$ . For example, we can set $T_{\theta}(x)=Ux+b$ , with $\theta=(U,b),U\in GL_{d}(\mathbb{R}),~{}b\in\mathbb{R}^{d}$ ; we can also let $T_{\theta}$ be a neural network with parameter $\theta$ . We further assume that $T_{\theta}$ is invertible and smooth with respect to parameter $\theta$ and variable $x$ .

Denote $p\in\mathcal{P}$ as a reference probability measure with positive density defined on $M$ . For example, we can choose $p$ as the standard Gaussian. We denote $\rho_{\theta}$ as the density of ${T_{\theta}}_{\#}p$ .111Let $X,Y$ be two measurable spaces, $\lambda$ is a probability measure defined on $X$ ; let $T:X\rightarrow Y$ be a measurable map, then $T_{\#}\lambda$ is defined as: $T_{\#}\lambda(E)=\lambda(T^{-1}(E))$ for all measurable $E\subset Y$ . We call $T_{\#}p$ the pushforward of measure $p$ by map $T$ . We further require: $\int|T_{\theta}(x)|^{2}~{}dp(x)<\infty$ holds for all $\theta\in\Theta$ . Then $\rho_{\theta}\in\mathcal{P}$ for each $\theta\in\Theta$ . Denote $\mathcal{P}_{\Theta}=\{\rho_{\theta}=\rho(\theta,x)|\theta\in\Theta\}$ , then $\mathcal{P}_{\Theta}\subset\mathcal{P}$ .

Now the connection between $\mathcal{P}$ and $\Theta$ is the pushforward operation $T_{\#}:\Theta\rightarrow\mathcal{P}_{\Theta}\subset\mathcal{P},\theta\mapsto\rho_{\theta}$ . In order to introduce the Wasserstein metric to parameter space $\Theta$ , we assume that $T_{\#}$ is an isometric immersion from $\Theta$ to $\mathcal{P}$ . Under this assumption, the pullback $(T_{\#})^{*}g^{W}$ of the Wasserstein metric $g^{W}$ by $T_{\#}$ is the metric tensor on $\Theta$ . Let us denote $G=(T_{\#})^{*}g^{W}$ . Then for each $\theta$ , $G(\theta)$ is a bilinear form on $T_{\theta}\Theta\simeq\mathbb{R}^{m}$ , thus $G(\theta)$ can be treated as an $m\times m$ matrix. Computation of $G(\theta)$ is illustrated in the following theorem:

Theorem 2.1

Suppose $T_{\#}:\Theta\rightarrow\mathcal{P}$ is isometric immersion from $\Theta$ to $\mathcal{P}$ . Then the metric tensor $G(\theta)$ at $\theta\in\Theta$ is $m\times m$ non-negative definite symmetric matrix and can be computed as:

[TABLE]

Or in entry-wised form:

[TABLE]

Here $\boldsymbol{\Psi}=(\psi_{1},...\psi_{m})^{T}$ and $\nabla\boldsymbol{\Psi}$ is $m\times d$ Jacobian matrix of $\boldsymbol{\Psi}$ . For each $k=1,2,...,m$ , $\psi_{k}$ solves the following equation:

[TABLE]

Proof

Suppose $\xi\in T\Theta$ is a vector field on $\Theta$ , for a fixed $\theta\in\Theta$ , we first compute the pushforward $(T_{\#}|_{\theta})_{*}\xi(\theta)$ of $\xi$ at point $\theta$ : We choose any differentiable curve $\{\theta_{t}\}_{t\geq 0}$ on $\Theta$ with $\theta_{0}=\theta$ and $\dot{\theta}_{0}=\xi(\theta)$ . If we denote $\rho_{\theta_{t}}={T_{\theta_{t}}}_{\#}p$ , then we have $(T_{\#})_{*}\xi(\theta)=\frac{\partial\rho_{\theta_{t}}}{\partial t}\Bigr{|}_{t=0}$ . To compute $\frac{\partial\rho_{\theta_{t}}}{\partial t}\Bigr{|}_{t=0}$ , we consider for any $\phi\in C^{\infty}_{0}(M)$ :

[TABLE]

This weak formulation reveals that

[TABLE]

Now let us compute the metric tensor $G$ . Since $T_{\#}$ is isometric immersion from $\Theta$ to $\mathcal{P}$ , the pullback of $g^{W}$ by $T_{\#}$ gives $G$ , i.e. $(T_{\#})^{*}g^{W}=G$ . By definition of pullback map, for any $\xi\in T\Theta$ and for any $\theta\in\Theta$ , we have:

[TABLE]

To compute the right hand side of (8), recall (2.1), we need to solve for $\varphi$ from:

[TABLE]

By (7), (9) is:

[TABLE]

We can straightforwardly check that $\varphi(x)=\boldsymbol{\Psi}^{T}(x)\xi(\theta)$ is the solution of (10). Then $G(\theta)$ is computed as:

[TABLE]

Thus we can verify that:

[TABLE]

Generally speaking, the metric tensor $G$ doesn’t have an explicit form when $d\geq 2$ ; but for $d=1$ , $G$ has an explicit form and can be computed directly.

Corollary 1

When dimension $d$ of $M$ equals 1. And we further assume that: $\rho_{\theta}>0$ on $M$ and $\lim_{x\rightarrow\pm\infty}\rho_{\theta}(x)=0$ . Then $G(\theta)$ has an explicit form:

[TABLE]

The following theorem ensures the positive definiteness of the metric tensor $G$ :

Theorem 2.2

We follow the notations and conditions in section 2.2,2.3. Then $G$ is Riemmanian metric on $T\Theta$ iff For each $\theta\in\Theta$ , for any $\xi\in T_{\theta}\Theta~{}(\xi\neq 0)$ , we can find $x\in M$ such that $\nabla\cdot(\rho_{\theta}~{}\partial_{\theta}T_{\theta}(T_{\theta}^{-1}(x)\xi)\neq 0$ .

From now on, following [9, 10], we call $(\Theta,G)$ Wasserstein statistical manifold.

2.3 Fokker-Planck equation on statistical manifold

Recall the relative entropy functional $\mathcal{F}$ defined in (4), we consider $F=\mathcal{F}\circ T_{\#}:\Theta\rightarrow\mathbb{R}$ . Then:

[TABLE]

As in [1], the gradient flow of $F$ on Wasserstein statistical manifold $(\Theta,G)$ satisfies

[TABLE]

We call (13) parametric Fokker-Planck equation. The ODE (13) as the Wasserstein gradient flow on parameter space $(\Theta,G)$ is closely related to Fokker-Planck equation on probability submanifold $\mathcal{P}_{\Theta}$ . We have the following theorem, which is a natural result derived from submanifold geometry:

Theorem 2.3

Suppose $\{\theta_{t}\}_{t\geq 0}$ solves (13). Then $\{\rho_{\theta_{t}}\}$ is the gradient flow of $\mathcal{F}$ on probability submanifold $\mathcal{P}_{\Theta}$ .

3 Example on Fokker-Planck equations with quadratic potential

The solution of Fokker-Planck equation on statistical manifold (13) can serve as an approximation to the solution of the original equation (1). However, in some special cases, $\rho_{\theta_{t}}$ exactly solves (1). In this section, we demonstrate such examples.

Let us consider Fokker-Planck equations with quadratic potentials whose initial conditions are Gaussian, i.e.

[TABLE]

Consider parameter space $\Theta=(\Gamma,b)\subset\mathbb{R}^{m}$ ( $m=d(d+1)$ ), where $\Gamma$ is a $d\times d$ invertible matrix with $\det(\Gamma)>0$ and $b\in\mathbb{R}^{d}$ . We define the parametric map as $T_{\theta}(x)=\Gamma x+b$ . We choose the reference measure $p=\mathcal{N}(0,I)$ . Here is the lemma we have to use:

Lemma 1

*Let $\mathcal{F}$ be the relative entropy defined in (4) and $F$ defined in (12). For $\theta\in\Theta$ , If the vector function $\nabla\left(\frac{\delta\mathcal{F}}{\delta\rho}\right)\circ T_{\theta}$ can be written as the linear combination of $\{\frac{\partial T_{\theta}}{\partial\theta_{1}},...,\frac{\partial T_{\theta}}{\partial\theta_{m}}\}$ , i.e. there exists $\zeta\in\mathbb{R}^{m}$ , such that $\nabla\left(\frac{\delta\mathcal{F}}{\delta\rho}\right)\circ T_{\theta}(x)=\partial_{\theta}T_{\theta}(x)\zeta$ . Then:

$\zeta=G(\theta)^{-1}\nabla_{\theta}F(\theta)$ , which is the Wasserstein gradient of $F$ at $\theta$ .
If we denote the gradient of $\mathcal{F}$ on $\mathcal{P}$ as $\mathrm{grad}\mathcal{F}(\rho_{\theta})$ and the gradient of $\mathcal{F}$ on the submanifold $\mathcal{P}_{\Theta}$ as $\mathrm{grad}\mathcal{F}(\rho_{\theta})|_{\mathcal{P}_{\Theta}}$ , then $\mathrm{grad}\mathcal{F}(\rho_{\theta})|_{\mathcal{P}_{\Theta}}=\mathrm{grad}\mathcal{F}(\rho_{\theta})$ .*

Proof

The detailed proof is provided in [8]. Here is an intuitive explanation: $\nabla\left(\frac{\delta\mathcal{F}}{\delta\rho}\right)=\nabla V+\beta\nabla\log\rho_{\theta}$ is the real vector field that moves the particles in Fokker-Planck equation; and $\partial_{\theta}T_{\theta}(T_{\theta}^{-1}(\cdot))\dot{\theta}$ is the approximate vector field induced by the pushforward map $T_{\theta}$ . If such approximate is perfect with zero error, i.e. exits $\zeta$ such that $\nabla\left(\frac{\delta\mathcal{F}}{\delta\rho}\right)\circ T_{\theta}(x)=\partial_{\theta}T_{\theta}(x)\zeta$ , then $\zeta=\dot{\theta}=G(\theta)^{-1}\nabla_{\theta}F(\theta)$ and the submanifold gradient agrees with entire manifold gradient.

Now, let us come back to our example, we can compute

[TABLE]

Then we have:

[TABLE]

is affine w.r.t. $x$ .

Notice that $\partial_{\Gamma_{ij}}T_{\theta}(x)=(..0..\underset{i-\mathrm{th}}{x_{j}}..0..)^{T}$ and $\partial_{b_{i}}T_{\theta}=(..0..\underset{i-\mathrm{th}}{1}..0..)^{T}$ . We can verify that $\zeta=(\Sigma^{-1}\Gamma-\beta\Gamma^{-T},\Sigma^{-1}(b-\mu))$ solves $\nabla\left(\frac{\delta\mathcal{F}(\rho_{\theta})}{\delta\rho}\right)\circ T_{\theta}(x)=\partial_{\theta}T_{\theta}(x)\zeta$ . By 1) of Corollary 1, $\zeta=G(\theta)^{-1}\nabla_{\theta}F(\theta)$ . Thus ODE (13) for our example is:

[TABLE]

By 2) of Corollary 1, we know $\mathrm{grad\mathcal{F}(\rho_{\theta})|_{\mathcal{P}_{\Theta}}=\mathrm{grad}\mathcal{F}(\rho_{\theta})}$ for all $\theta\in\Theta$ . This indicates that there is no local error for our approximation, one can verify that the solution to the parametric Fokker-Planck equation also solves the original equation.

In addition to previous results, we have the following corollary:

Corollary 2

The solution of Fokker-Planck equation (1) with condition(14) is Gaussian distribution for all $t>0$ .

Proof

If we denote $\{\Gamma_{t},b_{t}\}$ as the solutions to (15),(16), set $\theta_{t}=(\Gamma_{t},b_{t})$ , then $\rho_{t}={T_{\theta_{t}}}_{\#}p$ solves the Fokker Planck Equation (1) with conditions (14). Since the pushforward of Gaussian distribution $p$ by an affine transform $T_{\theta}$ is still a Gaussian, we conclude that for any $t>0$ , the solution $\rho_{t}={T_{\theta_{t}}}_{\#}p$ is always Gaussian distribution. This is already a well known result about Fokker-Planck equation. We reprove it under our framework.

4 Numerical examples for 1D Fokker-Planck equation

Since the Wasserstein metric tensor $G$ has an explicit solution when dimension $d=1$ , it is convenient to numerically compute ODE (13).

For example, we can choose a series of basis functions $\{\varphi_{k}\}_{k=1}^{n}$ . Each $\varphi_{k}$ can be chosen as a sinusoidal function or a piece-wise linear function defined on a certain interval $[-l,l]$ . It is also beneficial to choose orthogonal or near-orthogonal basis functions because they will keep the metric tensor $G$ far away from ill-posedness. We set $T_{\theta}(x)=\sum_{k=1}^{m}\theta_{k}\varphi_{k}(x)$ 222In application, carefully choosing $T_{\theta}$ which is not necessarily invertibile or smooth can still provide valid results.. Then according to (11), we can compute $G$ as

[TABLE]

Recall that $F(\theta)=\int V(x)\rho_{\theta}(x)dx+\beta\int\rho_{\theta}(x)\log\rho_{\theta}(x)dx$ . The second part of $F$ is the entropy of $\rho_{\theta}$ , which can be computed by solving the following optimization problem [4]:

[TABLE]

We can solve (17) by parametrizing $h$ . Suppose the optimal solution is $h^{*}$ . Then by envelope theorem, we know $\nabla_{\theta}F(\theta)$ can be computed as

[TABLE]

Notice that both the metric tensor $G$ and $\nabla_{\theta}F(\theta)$ are written in forms of expectations, thus we can compute them by Monte Carlo simulations. And finally, (13) can be computed by forward Euler method.

Our numerical results are always demonstrated by sample points: For each time node $t$ , we sample points $\{\mathbf{X}_{1},...,\mathbf{X}_{N}\}$ from $p$ , then $\{T_{\theta_{t}}(\mathbf{X}_{1}),...,T_{\theta_{t}}(\mathbf{X}_{N})\}$ are our numerical samples from distribution $\rho_{t}$ which solves the Fokker-Planck equation.

Here are several numerical results based on our method. We exhibit them in the form of histograms. Consider the potential $V(x)=(x+1)^{2}(x-1)^{2}$ . Suppose the initial distribution is $\rho_{0}=\mathcal{N}(0,I)$ . Figure 1 contains histograms of $\rho_{t}$ which solves $\frac{\partial\rho}{\partial t}=\nabla\cdot(\rho\nabla V)$ at different time nodes; we know $\rho_{t}$ converges to $\frac{\delta_{-1}+\delta_{+1}}{2}$ as $t\rightarrow\infty$ . Here $\delta_{a}$ is the Dirac distribution concentrated on point $a$ . Figure 2 contains histograms of $\rho_{t}$ which solves $\frac{\partial\rho}{\partial t}=\nabla\cdot(\rho\nabla V)+\frac{1}{4}\Delta\rho$ at different time nodes, we know $\rho_{t}$ will converge to Gibbs distribution $\rho_{*}=\frac{1}{Z}\exp(-4(x+1)^{2}(x-1)^{2})$ , with $Z$ being a normalizing constant, as $t\rightarrow\infty$ . The density function of $\rho_{*}$ is exhibited in Figure 2.

5 Discussion

We presented a new approach for approximating Fokker-Planck equations by parameterized push-forward mapping functions. Compared to the classical moment method and MCMC method, we propose a systemic way for obtaining a finite dimensional ODE on parameter space. The ODE represents the evolution of statistical information conveyed in the original Fokker-Planck equation. In the future, we will study its geometric and statistical properties, and derive practical numerical methods for applications in scientific computing and machine learning.

Acknowledgement This project has received funding from AFOSR MURI FA9550-18-1-0502 and NSF Awards DMS–1419027, DMS-1620345, and ONR Award N000141310408.

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Amari. Natural Gradient Works Efficiently in Learning. Neural Computation , 10(2):251–276, 1998.
2[2] S. Amari. Information Geometry and Its Applications . Number volume 194 in Applied Mathematical Sciences. Springer, Japan, 2016.
3[3] N. Ay, J. Jost, H. V. Lê, and L. J. Schwachhöfer. Information Geometry . Ergebnisse Der Mathematik Und Ihrer Grenzgebiete A @series of Modern Surveys in Mathematics$l 3. Folge, Volume 64. Springer, Cham, 2017.
4[4] M. Essid, D. Laefer, and E. G. Tabak. Adaptive Optimal Transport. ar Xiv:1807.00393 [math] , 2018.
5[5] R. Jordan, D. Kinderlehrer, and F. Otto. The Variational Formulation of the Fokker–Planck Equation. SIAM Journal on Mathematical Analysis , 29(1):1–17, 1998.
6[6] J. D. Lafferty. The Density Manifold and Configuration Space Quantization. Transactions of the American Mathematical Society , 305(2):699–741, 1988.
7[7] W. Li. Geometry of probability simplex via optimal transport. ar Xiv:1803.06360 [math] , 2018.
8[8] W. Li, S. Liu, H. Zha, and H. Zhou. Scientific computing via parametric fokker-planck equations. In preparation , 2019.