Fisher information regularization schemes for Wasserstein gradient flows

Wuchen Li; Jianfeng Lu; Li Wang

arXiv:1907.02152·math.NA·July 15, 2020

Fisher information regularization schemes for Wasserstein gradient flows

Wuchen Li, Jianfeng Lu, Li Wang

PDF

TL;DR

This paper introduces a Fisher information regularized variational scheme for Wasserstein gradient flows, improving convexity, stability, and computational efficiency, with applications to various PDEs.

Contribution

It develops a novel regularization approach based on Fisher information within the Jordan--Kinderlehrer--Otto framework, enhancing numerical stability and efficiency.

Findings

01

Improves convexity and stability of Wasserstein gradient flow computations.

02

Reduces computational cost by eliminating the need for additional time interpolation.

03

Demonstrates effectiveness on multiple PDE examples, including porous media and nonlinear Fokker-Planck equations.

Abstract

We propose a variational scheme for computing Wasserstein gradient flows. The scheme builds upon the Jordan--Kinderlehrer--Otto framework with the Benamou-Brenier's dynamic formulation of the quadratic Wasserstein metric and adds a regularization by the Fisher information. This regularization can be derived in terms of energy splitting and is closely related to the Schr{\"o}dinger bridge problem. It improves the convexity of the variational problem and automatically preserves the non-negativity of the solution. As a result, it allows us to apply sequential quadratic programming to solve the sub-optimization problem. We further save the computational cost by showing that no additional time interpolation is needed in the underlying dynamic formulation of the Wasserstein-2 metric, and therefore, the dimension of the problem is vastly reduced. Several numerical examples, including porous…

Equations216

\partial_{t} ρ = - \nabla \cdot (ρ v) := \nabla \cdot [ρ \nabla (U^{'} (ρ) + V + W * ρ)], ρ (0, \cdot) = ρ_{0},

\partial_{t} ρ = - \nabla \cdot (ρ v) := \nabla \cdot [ρ \nabla (U^{'} (ρ) + V + W * ρ)], ρ (0, \cdot) = ρ_{0},

\nabla_{W_{2}} E (ρ) = - \nabla \cdot (ρ \nabla δ E),

\nabla_{W_{2}} E (ρ) = - \nabla \cdot (ρ \nabla δ E),

E (ρ) = \int_{Ω} [U (ρ (x)) + V (x) ρ (x)] d x + \frac{1}{2} \int_{Ω \times Ω} W (x - y) ρ (x) ρ (y) d x d y .

E (ρ) = \int_{Ω} [U (ρ (x)) + V (x) ρ (x)] d x + \frac{1}{2} \int_{Ω \times Ω} W (x - y) ρ (x) ρ (y) d x d y .

ρ^{0} = ρ_{0}, ρ^{k + 1} = ar g ρ \in K min {W_{2}^{2} (ρ, ρ^{k}) + 2 τ E (ρ)},

ρ^{0} = ρ_{0}, ρ^{k + 1} = ar g ρ \in K min {W_{2}^{2} (ρ, ρ^{k}) + 2 τ E (ρ)},

⎩ ⎨ ⎧ W_{2} (ρ_{0}, ρ_{1}) = ρ, v in f {\int_{0}^{1} \int_{Ω} ∣ v (t, x) ∣^{2} ρ (t, x) d x d t}^{1/2}, s.t. \partial_{t} ρ + \nabla \cdot (ρ v) = 0 (ρ v) \cdot ν = 0 on \partial Ω \times [0, 1], ρ (0, x) = ρ_{0} (x), ρ (1, x) = ρ_{1} (x),

⎩ ⎨ ⎧ W_{2} (ρ_{0}, ρ_{1}) = ρ, v in f {\int_{0}^{1} \int_{Ω} ∣ v (t, x) ∣^{2} ρ (t, x) d x d t}^{1/2}, s.t. \partial_{t} ρ + \nabla \cdot (ρ v) = 0 (ρ v) \cdot ν = 0 on \partial Ω \times [0, 1], ρ (0, x) = ρ_{0} (x), ρ (1, x) = ρ_{1} (x),

⎩ ⎨ ⎧ (ρ, m) = ar g ρ, m in f \int_{0}^{1} \int_{Ω} F (ρ, m) d x d t + 2 τ E (ρ (1, \cdot)) s.t. \partial_{t} ρ + \nabla \cdot m = 0, ρ (0, x) = ρ^{k} (x), m \cdot ν = 0,

⎩ ⎨ ⎧ (ρ, m) = ar g ρ, m in f \int_{0}^{1} \int_{Ω} F (ρ, m) d x d t + 2 τ E (ρ (1, \cdot)) s.t. \partial_{t} ρ + \nabla \cdot m = 0, ρ (0, x) = ρ^{k} (x), m \cdot ν = 0,

\displaystyle F(\rho,m)=\left\{\begin{array}[]{cc}\frac{\|m\|^{2}}{\rho}&\textrm{ if }\rho>0\,,\\ 0&\textrm{\hskip 5.69046pt if }(\rho,m)=(0,0)\,,\\ +\infty&\textrm{ otherwise\,.}\end{array}\right.

\displaystyle F(\rho,m)=\left\{\begin{array}[]{cc}\frac{\|m\|^{2}}{\rho}&\textrm{ if }\rho>0\,,\\ 0&\textrm{\hskip 5.69046pt if }(\rho,m)=(0,0)\,,\\ +\infty&\textrm{ otherwise\,.}\end{array}\right.

\begin{dcases}&\rho^{k+1}(x)=\arg\inf_{\rho,m}~{}\int_{\Omega}\Big{(}\frac{\|m(x)\|^{2}}{\rho(x)}+{\beta^{-2}}\tau^{2}\|\nabla\log\rho(x)\|^{2}\rho(x)\Big{)}\mathrm{d}x+2\tau\mathcal{E}(\rho)\\ &\textrm{s.t.}\quad\rho(x)-\rho^{k}(x)+\nabla\cdot m(x)=0,~{}m\cdot\nu=0\,.\end{dcases}

\begin{dcases}&\rho^{k+1}(x)=\arg\inf_{\rho,m}~{}\int_{\Omega}\Big{(}\frac{\|m(x)\|^{2}}{\rho(x)}+{\beta^{-2}}\tau^{2}\|\nabla\log\rho(x)\|^{2}\rho(x)\Big{)}\mathrm{d}x+2\tau\mathcal{E}(\rho)\\ &\textrm{s.t.}\quad\rho(x)-\rho^{k}(x)+\nabla\cdot m(x)=0,~{}m\cdot\nu=0\,.\end{dcases}

\mathcal{P}(\Omega)=\Big{\{}\rho\in L^{1}(\Omega)\colon\int_{\Omega}\rho(x)\mathrm{d}x=1,~{}\rho(x)\geq 0\Big{\}}.

\mathcal{P}(\Omega)=\Big{\{}\rho\in L^{1}(\Omega)\colon\int_{\Omega}\rho(x)\mathrm{d}x=1,~{}\rho(x)\geq 0\Big{\}}.

SBP (ρ^{0}, ρ^{1}) = ρ, b in f \int_{0}^{1} \int_{Ω} ∥ b (t, x) ∥^{2} ρ (t, x) d x d t,

SBP (ρ^{0}, ρ^{1}) = ρ, b in f \int_{0}^{1} \int_{Ω} ∥ b (t, x) ∥^{2} ρ (t, x) d x d t,

\partial_{t} ρ (t, x) + \nabla \cdot (ρ (t, x) b (t, x)) = β^{- 1} τ Δ ρ (t, x),

\partial_{t} ρ (t, x) + \nabla \cdot (ρ (t, x) b (t, x)) = β^{- 1} τ Δ ρ (t, x),

ρ (0, x) = ρ^{0} (x), ρ (1, x) = ρ^{1} (x), x \in Ω,

ρ (0, x) = ρ^{0} (x), ρ (1, x) = ρ^{1} (x), x \in Ω,

SBP (ρ^{0}, ρ^{1}) = ρ, m in f \int_{0}^{1} \int_{Ω} (\frac{∥ m ∥ ^{2}}{ρ} + β^{- 2} τ^{2} ρ ∥ \nabla δ H ∥^{2}) d x d t + 2 β^{- 1} τ (H (ρ^{1}) - H (ρ^{0}))

SBP (ρ^{0}, ρ^{1}) = ρ, m in f \int_{0}^{1} \int_{Ω} (\frac{∥ m ∥ ^{2}}{ρ} + β^{- 2} τ^{2} ρ ∥ \nabla δ H ∥^{2}) d x d t + 2 β^{- 1} τ (H (ρ^{1}) - H (ρ^{0}))

\partial_{t} ρ (t, x) + \nabla \cdot m (t, x) = 0,

\partial_{t} ρ (t, x) + \nabla \cdot m (t, x) = 0,

ρ (0, x) = ρ^{0} (x), ρ (1, x) = ρ^{1} (x), x \in Ω, m \cdot ν = 0, (t, x) \in [0, 1] \times \partial Ω .

ρ (0, x) = ρ^{0} (x), ρ (1, x) = ρ^{1} (x), x \in Ω, m \cdot ν = 0, (t, x) \in [0, 1] \times \partial Ω .

0 = \partial_{t} ρ + \nabla \cdot (ρ b) - β^{- 1} τ Δ ρ = \partial_{t} ρ + \nabla \cdot (ρ (b - β^{- 1} τ \nabla δ H)) .

0 = \partial_{t} ρ + \nabla \cdot (ρ b) - β^{- 1} τ Δ ρ = \partial_{t} ρ + \nabla \cdot (ρ (b - β^{- 1} τ \nabla δ H)) .

\int_{0}^{1} \int_{Ω} ∥ b (t, x) ∥^{2} ρ (t, x) d x d t = = = \int_{0}^{1} \int_{Ω} ∥ v (t, x) + β^{- 1} τ \nabla δ H (ρ) (t, x) ∥^{2} ρ (t, x) d x d t \int_{0}^{1} \int_{Ω} {∥ v ∥^{2} ρ + β^{- 2} τ^{2} ∥\nabla δ H ∥^{2} ρ + 2 β^{- 1} τ ρ v \cdot \nabla δ H} d x d t \int_{0}^{1} \int_{Ω} {\frac{∥ m ∥ ^{2}}{ρ} + β^{- 2} τ^{2} ∥\nabla δ H ∥^{2} ρ + 2 β^{- 1} τ m \cdot \nabla δ H} d x d t .

\int_{0}^{1} \int_{Ω} ∥ b (t, x) ∥^{2} ρ (t, x) d x d t = = = \int_{0}^{1} \int_{Ω} ∥ v (t, x) + β^{- 1} τ \nabla δ H (ρ) (t, x) ∥^{2} ρ (t, x) d x d t \int_{0}^{1} \int_{Ω} {∥ v ∥^{2} ρ + β^{- 2} τ^{2} ∥\nabla δ H ∥^{2} ρ + 2 β^{- 1} τ ρ v \cdot \nabla δ H} d x d t \int_{0}^{1} \int_{Ω} {\frac{∥ m ∥ ^{2}}{ρ} + β^{- 2} τ^{2} ∥\nabla δ H ∥^{2} ρ + 2 β^{- 1} τ m \cdot \nabla δ H} d x d t .

= = = \int_{0}^{1} \int_{Ω} m \cdot \nabla δ H d x d t - \int_{0}^{1} \int_{Ω} δ H \nabla \cdot m d x d t Integration by parts w.r.t. x \int_{0}^{1} \int_{Ω} δ H \partial_{t} ρ d x d t \int_{0}^{1} \frac{d}{d t} H (ρ) d t = H (ρ^{1}) - H (ρ^{0}),

= = = \int_{0}^{1} \int_{Ω} m \cdot \nabla δ H d x d t - \int_{0}^{1} \int_{Ω} δ H \nabla \cdot m d x d t Integration by parts w.r.t. x \int_{0}^{1} \int_{Ω} δ H \partial_{t} ρ d x d t \int_{0}^{1} \frac{d}{d t} H (ρ) d t = H (ρ^{1}) - H (ρ^{0}),

I (ρ) = \int_{Ω} ∥ \nabla δ H ∥^{2} ρ (x) d x = \int_{Ω} ∥\nabla lo g ρ (x) ∥^{2} ρ (x) d x

I (ρ) = \int_{Ω} ∥ \nabla δ H ∥^{2} ρ (x) d x = \int_{Ω} ∥\nabla lo g ρ (x) ∥^{2} ρ (x) d x

E (ρ) = (E (ρ) - β^{- 1} H (ρ)) + β^{- 1} H (ρ) := E_{1} (ρ) + E_{2} (ρ),

E (ρ) = (E (ρ) - β^{- 1} H (ρ)) + β^{- 1} H (ρ) := E_{1} (ρ) + E_{2} (ρ),

⎩ ⎨ ⎧ (ρ, m) = ar g ρ, m in f \int_{0}^{1} \int_{Ω} \frac{∥ m ( t , x ) ∥ ^{2}}{ρ ( t , x )} d x d t + 2 τ E_{1} (ρ (1, \cdot)) s.t. \partial_{t} ρ + \nabla \cdot m = τ \nabla \cdot (ρ \nabla δ E_{2} (ρ)) = τ β^{- 1} Δ ρ, ρ (0, x) = ρ^{k} (x), (m - τ β^{- 1} \nabla ρ) \cdot ν = 0 .

⎩ ⎨ ⎧ (ρ, m) = ar g ρ, m in f \int_{0}^{1} \int_{Ω} \frac{∥ m ( t , x ) ∥ ^{2}}{ρ ( t , x )} d x d t + 2 τ E_{1} (ρ (1, \cdot)) s.t. \partial_{t} ρ + \nabla \cdot m = τ \nabla \cdot (ρ \nabla δ E_{2} (ρ)) = τ β^{- 1} Δ ρ, ρ (0, x) = ρ^{k} (x), (m - τ β^{- 1} \nabla ρ) \cdot ν = 0 .

L_{1} (ρ, m, ϕ)

L_{1} (ρ, m, ϕ)

\displaystyle=\int_{0}^{1}\int_{\Omega}\frac{\|m\|^{2}}{\rho}-\rho\partial_{t}\phi-m\cdot\nabla\phi\mathrm{d}x\mathrm{d}t+\int_{\Omega}\rho\phi\big{|}_{t=0}^{t=1}\mathrm{d}x+2\tau\mathcal{E}(\rho(1,\cdot))\,,

- \frac{∥ m ∥ ^{2}}{ρ ^{2}} - \partial_{t} ϕ = 0, \frac{2 m}{ρ} - \nabla ϕ = 0, ϕ (1, x) + 2 τ δ E (ρ (1, \cdot)) = 0 .

- \frac{∥ m ∥ ^{2}}{ρ ^{2}} - \partial_{t} ϕ = 0, \frac{2 m}{ρ} - \nabla ϕ = 0, ϕ (1, x) + 2 τ δ E (ρ (1, \cdot)) = 0 .

m (1, x) = - τ ρ (1, x) \nabla δ E (ρ (1, \cdot)) .

m (1, x) = - τ ρ (1, x) \nabla δ E (ρ (1, \cdot)) .

L_{2} (ρ, m, ϕ)

L_{2} (ρ, m, ϕ)

\displaystyle=\int_{0}^{1}\int_{\Omega}\frac{\|m\|^{2}}{\rho}-\rho\partial_{t}\phi-m\cdot\nabla\phi+\tau\rho\nabla\phi\cdot\nabla\delta\mathcal{E}_{2}\mathrm{d}x\mathrm{d}t+\int_{\Omega}\rho\phi\big{|}_{t=0}^{t=1}\mathrm{d}x+2\tau\mathcal{E}_{1}(\rho(1,\cdot)),

\frac{2 m}{ρ} - \nabla ϕ = 0, ϕ (1, x) + 2 τ δ E_{2} (ρ (1, \cdot)) = 0 .

\frac{2 m}{ρ} - \nabla ϕ = 0, ϕ (1, x) + 2 τ δ E_{2} (ρ (1, \cdot)) = 0 .

\int_{0}^{1} \int_{Ω} (\frac{∥ m ~ ∥ ^{2}}{ρ} + τ^{2} β^{- 2} \frac{∥\nabla ρ ∥ ^{2}}{ρ} + 2 τ β^{- 1} \frac{m ~ \cdot \nabla ρ}{ρ}) d t d x + 2 τ E_{2} (ρ (1, \cdot))

\int_{0}^{1} \int_{Ω} (\frac{∥ m ~ ∥ ^{2}}{ρ} + τ^{2} β^{- 2} \frac{∥\nabla ρ ∥ ^{2}}{ρ} + 2 τ β^{- 1} \frac{m ~ \cdot \nabla ρ}{ρ}) d t d x + 2 τ E_{2} (ρ (1, \cdot))

=

=

=

⎩ ⎨ ⎧ ρ^{k + 1} (x) = ar g m, ρ in f \int_{0}^{1} \int_{Ω} \frac{∥ m ( t , x ) ∥ ^{2}}{ρ ( t , x )} + β^{- 2} τ^{2} ρ (t, x) ∥ \nabla δ H (t, x) ∥^{2} d t d x + 2 τ E (ρ (1, \cdot)) s.t. \partial_{t} ρ + \nabla \cdot m = 0, ρ (0, x) = ρ^{k} (x), m \cdot ν = 0.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Fisher information regularization schemes for Wasserstein gradient flows

Wuchen Li

Mathematics department, University of California, Los Angeles 90095

[email protected]

,

Jianfeng Lu

Departments of Mathematics, Physics, and Chemistry, Duke University, Box 90320, Durham, NC 27708.

[email protected]

and

Li Wang

School of Mathematics, University of Minnesota, Twin cities, MN 55455.

[email protected]

Abstract.

We propose a variational scheme for computing Wasserstein gradient flows. The scheme builds upon the Jordan–Kinderlehrer–Otto framework with the Benamou-Brenier’s dynamic formulation of the quadratic Wasserstein metric and adds a regularization by the Fisher information. This regularization can be derived in terms of energy splitting and is closely related to the Schrödinger bridge problem. It improves the convexity of the variational problem and automatically preserves the non-negativity of the solution. As a result, it allows us to apply sequential quadratic programming to solve the sub-optimization problem. We further save the computational cost by showing that no additional time interpolation is needed in the underlying dynamic formulation of the Wasserstein-2 metric, and therefore, the dimension of the problem is vastly reduced. Several numerical examples, including porous media equation, nonlinear Fokker-Planck equation, aggregation diffusion equation, and Derrida-Lebowitz-Speer-Spohn equation, are provided. These examples demonstrate the simplicity and stableness of the proposed scheme.

Key words and phrases:

Time discretization; Gradient flow; Fisher information; Optimal transport; Schrödinger bridge problem.

1. Introduction

Consider the general continuity equation of the form:

[TABLE]

where $\rho(t,x)$ , $x\in\Omega\subset\mathbb{R}^{n}$ is the particle density function, $U(\rho)$ is an internal energy, $V(x)$ is a drift potential, and $W(x,y)=W(y,x)$ is an interaction potential. $\nabla$ , $\nabla\cdot$ are gradient and divergence operators with respect to $x$ in $\Omega$ . This equation can be derived as a mean-field limit of particle systems with a number of physical and biological applications, such as granular materials [22], chemotaxis, animal swarming [19, 3], and many others. In particular, the Fokker-Planck equation [44], porous medium equation [54], aggregation equation [59, 38], Keller-Segel equation [45], and quantum drift-diffusion equation [41] all fall within this framework.

As written, equation (1) possesses two immediate properties: it preserves the non-negativity of the solution and conserves total mass. Therefore, in what follows, we will always consider nonnegative initial data with mass one, so that the solution is in the set of probability measures on $\Omega$ , $\mathcal{P}(\Omega)$ . The third property of (1) is the dissipation of the energy, which can be seen as follows. Given an energy $\mathcal{E}:\mathcal{P}(\Omega)\rightarrow\mathbb{R}\cup\{+\infty\}$ , we may formally define its gradient with respect to the quadratic Wasserstein metric $\mathcal{W}_{2}$ as

[TABLE]

where $\delta$ always denotes the first variation in $\rho$ throughout the paper. Comparing it with (1), one can write the velocity field as $v=-\nabla\delta\mathcal{E}$ , and view equation (1) as the gradient flow of the energy

[TABLE]

Differentiating the energy (2) along solutions of (1), one formally obtains the decreasing of energy along the gradient flow $\frac{d}{dt}\mathcal{E}(\rho)(t)=-\int_{\mathbb{R}^{d}}|v(t,x)|^{2}\rho(t,x)\mathrm{d}x$ , which indicates that the solution evolves in the direction of steepest descent of an energy. This property entails a full characterization of the set of stationary states, and provides a necessary tool to study its stability.

Desired numerical methods for (1) are to attain all three properties above at the discrete level, which, however, are rather challenging. Existing methods have been developed on different prospects of the equation. One kind of methods views it as an advection diffusion equation and employs finite difference, finite volume, or discontinuous Galerkin [39, 31, 26, 52, 58, 1]. Such methods are explicit or semi-implicit in time, so the per time computation is cheap. But they often suffer from stability constraints, due either to the degeneracy of the diffusion or the non-locality from the interaction potential, such as the mesa problem [51]. Another approach leverages structural similarities between (1) and equations from fluid dynamics to develop particle methods [36, 12, 9, 53, 15]. On one hand, particle methods naturally conserve mass and positivity, and they can also be designed to respect the underlying gradient flow structure of the equation so as to dissipate the energy along time. On the other hand, a large number of particles is often required to resolve finer properties of solutions.

A third class of methods builds on a variational formulation, following the seminal work by Jordan, Kinderlehrer, and Otto [44]. Given a time step $\tau>0$ , the scheme (known as the JKO scheme) recursively defines a sequence $\rho^{k}$ as

[TABLE]

where $\mathbb{K}=\left\{\rho:\rho\in\mathcal{P}(\Omega),~{}\int_{\Omega}|x|^{2}\rho\,\mathrm{d}x<+\infty\right\}\,$ , and $\mathcal{W}_{2}$ denotes the quadratic Wasserstein distance between two probability measures. Therefore, (3) offers a positivity preserving, energy dissipating, and unconditionally stable time discretization. A major bottleneck of this approach is the computation of $\mathcal{W}_{2}^{2}(\rho,\rho^{k})$ , which is an infinite dimensional minimization problem. Hence, early works that use (3) avoid direct computing $\mathcal{W}_{2}^{2}(\rho,\rho^{k})$ either by linearization [42, 11] or by diffeomorphisms [10, 23, 24], which lead to methods that lose some inherited properties in (3) or are limited by complicated geometry and structure. Only recent progress in computing $\mathcal{W}_{2}$ has enabled the direct application of (3) [56, 6, 32, 27].

In the present work, we will adopt Benamou-Brenier’s dynamic formulation for the Wasserstein distance [5]. In particular, given two measures $\rho_{0}$ and $\rho_{1}$ , their Wasserstein distance can be obtained by solving

[TABLE]

where $\nu$ is the outer unit normal on the boundary of the domain $\Omega$ . Adapting (4) into (3), and let $m=\rho v$ , we have the following computable reformulation of the JKO scheme: given $\rho^{k}(x)$ , $\rho^{k+1}(x)=\rho(1,x)$ with $\rho(t,x)$ solving

[TABLE]

where

[TABLE]

To solve (5)–(9), there are two sources of difficulties. One lies in the non-smooth function of $F$ , so that second order information that often used to accelerate the optimization can not be applied. Occasionally, erroneous solution near $\rho=0$ may be produced. The other comes from the artificial time introduced in the dynamic formulation (4), which increases the dimension of the problem.

To overcome these two issues, we propose the following scheme: $\rho^{k+1}=\rho(1,x)$ where

[TABLE]

The additional term, $\int_{\Omega}\|\nabla\log\rho(x)\|^{2}\rho(x)\mathrm{d}x$ is the Fisher information functional. It keeps $\rho$ away from zero (see Theorem 5) and thus simplifies $\int_{\Omega}F(\rho,m)\mathrm{d}x$ to just $\int_{\Omega}\frac{\|m\|^{2}}{\rho}\mathrm{d}x$ . More importantly, it improves the convexity of the cost functional and gives access to the second order sequential programming which enjoys much faster convergence. In addition, we replace the time derivative in the dynamics by a one step finite difference. We shall show in Theorem 3 that such a simplification will not violate the first order accuracy of the original JKO formulation (5). Furthermore, as we shall see in Section 3, compared to classical backward Euler method that may suffer from ill-conditioned Jacobian [1], (10) provides a symmetric, structure preserving version of implicit method that is insensitive to the condition number of the Hessian, and has guaranteed convergence. We note that the relation between Fisher information and Schrödinger bridge problem (SBP) can be seen from [13, 29, 35, 46, 48], and will be further discussed in Section 2.

It is also important to mention that the Fisher information regularization is closely related to the entropic regularization that has been successfully applied in many optimal transport problems [37, 56, 14, 40, 32]. There, the Kantorovich formulation based on the joint distribution $\pi(x,y)$ between two measures is adopted, and the entropic regularization term $\int\int\pi(x,y)\log\pi(x,y)dxdy$ is added to the cost function, so that an iterative projection method, Sinkhorn method or more general Dykstra’s algorithm, can be applied with linear convergence. This method, when applied to gradient flow problem, has a major difficulty in computing the proximal of the energy (2) with respect to the Kullback-Leibler divergence, which does not have a closed form in general.

The paper is organized as follows. In the next section, we provide necessary background on the dynamical formulation of Schrodinger bridge problem (SBP), its relation with Wasserstein gradient flows and the Fisher information functional. We then derive the Fisher information regularized semi-discrete scheme in the end. In Section 3, we introduce a fully discrete scheme and study the properties of this new scheme. Numerical results are provided in Section 4, and the paper is concluded in Section 5.

2. Semi-discretization with Fisher information regularization

In this section, we briefly review the Schrödinger bridge problem, Fisher information regularization and Wasserstein gradient flow. We then weave together these ideas to derive our new regularized time discretization.

2.1. Schrödinger Bridge problem and Fisher regularization

Consider a bounded convex domain $\Omega\subset\mathbb{R}^{n}$ , and the probability density space

[TABLE]

Definition 1 (Schrödinger bridge problem).

Denote $\textrm{SBP}\colon\mathcal{P}(\Omega)\times\mathcal{P}(\Omega)\rightarrow\mathbb{R}$ . Given $\rho^{0}$ , $\rho^{1}\in\mathcal{P}(\Omega)$ , let

[TABLE]

where the infimum is taken among all drift functions $b\colon[0,1]\times\Omega\rightarrow\mathbb{R}^{n}$ and density functions $\rho\colon[0,1]\times\Omega\rightarrow\mathbb{R}$ satisfying the Fokker-Planck equation

[TABLE]

with the fixed initial and ending density functions

[TABLE]

and Neumann boundary condition for $b$ : $b\cdot\nu=0$ on $[0,1]\times\partial\Omega$ .

Here, $\beta$ , $\tau$ are two given constant parameters in $\mathbb{R}$ . Note specifically that when $\beta^{-1}\tau=0$ , $\textrm{SBP}(\rho^{0},\rho^{1})$ equals to the Wasserstein-2 distance between $\rho^{0}$ and $\rho^{1}$ , where the problem (11) is equivalent to the Benamou–Brenier formula [7]. Here we are using the product $\beta^{-1}\tau$ as a regularization parameter just to facilitate the derivation for the gradient flow later.

Interestingly, the variational problem (11) has the following symmetric reformulations.

Proposition 2 (Fisher information regularization).

Denote $\mathcal{H}(\rho)=\int_{\Omega}\rho(x)\log\rho(x)\mathrm{d}x$ , then

[TABLE]

subject to the dynamical constraint

[TABLE]

with initial and boundary conditions:

[TABLE]

Proof.

First, rewrite the the Fokker-Planck equation (12) as follows:

[TABLE]

where we notice the fact that $\delta\mathcal{H}(\rho)=\log\rho+1$ , and $\nabla\cdot(\rho\nabla\delta\mathcal{H}(\rho))=\nabla\cdot(\rho\nabla\log\rho)=\nabla\cdot(\nabla\rho)=\Delta\rho$ .

Denote $v=b-\beta^{-1}\tau\nabla\delta\mathcal{H}$ , and let $m=\rho v$ , then (12) reduces to (14). Next, we shall show that with the above definition of $m$ , the cost functional in (13) is the same as that in (11). Indeed,

[TABLE]

We then show that last term in the above equation only depends on the initial and final condition. This is seen from the fact that

[TABLE]

where the second last equality comes from the definition of $L^{2}$ first variation. The other direction of the equivalence follows similarly. ∎

Here, the symmetric version of Schrödinger bridge problem relates to the optimal control problem of gradient flows. See related geometric studies in [47, 49]. The additional term

[TABLE]

in the cost functional is named the Fisher information. In the sequel, we will apply the symmetric SBP to compute the Wasserstein gradient flow with $\mathcal{I}(\rho)$ serving as a regularization. The numerical benefits of this regularization will be discussed in Section 3.

2.2. Energy splitting and time discretization

We are now ready to derive the main scheme (10) of this paper. Starting from the classical JKO formulation (5) of the Wasserstein gradient flow (1), we split the energy into two parts

[TABLE]

and move $\mathcal{E}_{2}$ to the flow constraint in (5). Here $\mathcal{H}$ is the entropy $\int_{\Omega}\rho\log\rho\mathrm{d}x$ defined above. More specifically, given $\rho^{k}(x)$ , we update $\rho^{k+1}(x):=\rho(1,x)$ by solving the following new form for $\rho(1,x)$ :

[TABLE]

Intuitively, the difference between (5) and (16) lies in the flow of $\rho(t,x)$ , $0<t<1$ , between $\rho^{k}$ and $\rho^{k+1}$ . In (5), the flow $\rho(t,x)$ is purely convective, and one controls it at final time $t=1$ using full energy $\mathcal{E}$ ; whereas in (16), the diffusion effect in full energy (i.e., $\mathcal{E}_{2}$ ) is moved to modify the flow so that the flow is both convective and diffusive, and therefore one only need to control its partial energy $\mathcal{E}_{1}$ at the final time. Moreover, we observe that, the flux for $\rho(1,x)$ in this new form is the same as that in the original form (5). Since (5) provides a first order approximation of $\rho(t,x)$ that resembles backward Euler scheme, the equivalence in the flux implies that (16) is also a first order approximation in terms of $\tau$ . Indeed, we rewrite (5) using the Lagrangian multipliers

[TABLE]

then the optimality condition $\delta_{\rho,\phi,\rho(1,\cdot)}\mathcal{L}_{1}=0$ leads to

[TABLE]

Therefore $\phi$ satisfies the Hamilton-Jacobi equation $\partial_{t}\phi+\frac{1}{4}|\nabla\phi|^{2}=0$ , and

[TABLE]

Plugging it into the constraint PDE in (5), one gets the flux for the original gradient flow equation (1) after one time step $\tau$ . Similarly, we rewrite (16) as

[TABLE]

then the optimality condition $\delta_{m,\rho(1,\cdot)}\mathcal{L}_{2}=0$ leads to

[TABLE]

Consequently, $m(1,x)=-\tau\rho(1,x)\nabla\delta\mathcal{E}_{2}(\rho(1,\cdot))$ , which substituting back into the constraint of (16) leads to the same flux for $\rho(1,x)$ as in (18).

Next, we rewrite (16) in line with Proposition 2. Let $\tilde{m}=m-\tau\beta^{-1}\nabla\rho$ (so that $\partial_{t}\rho+\nabla\cdot\tilde{m}=0$ ), and plug it into the objective function in (16), we have

[TABLE]

where the reformulation of the third term in the second equation follows (15). Omitting the tilde in (19), (16) can be reformulated as

[TABLE]

In practice, we want to remove the additional dimension $t$ induced by the flow, and therefore approximate the derivate in $t$ in the constraint PDE of (20) by a one step difference and the integral in time in the objective function by a one term quadrature. This leads to our main scheme (10). In the following theorem, we show that, such an approximation does not violate the first order accuracy of the original JKO scheme.

Theorem 3 (Fisher information regularization scheme).

The minimizer of the variational problem (10) is a first-order time consistent scheme for Wasserstein gradient flow (1).

Proof.

First it is straightforward to check that variational problem (10) is strictly convex. We then solve it by the Lagrange multiplier method. Define the Lagrangian as:

[TABLE]

The critical solution of above variation problem forms

[TABLE]

The solution of above system $(m,\rho)$ satisfies

[TABLE]

Denote the solution of above system as follows: $\rho=\rho^{k+1}$ . We then derive the following update

[TABLE]

Therefore scheme (10) is a first order time discretization. ∎

It is worth mentioning that there are several cases that the Fisher information regularization schemes are exact for the computation of gradient flows.

Proposition 4 (Exact cases).

If $\beta=\sqrt{\tau/2}$ , then the iterative scheme (10) is a first order scheme for the equation

[TABLE]

Proof.

If $\beta=\sqrt{\tau/2}$ , then $\beta^{-2}\tau^{2}=2\tau$ , the scheme (10) becomes

[TABLE]

Following Theorem 3, the algorithm is a consistent first time discretization of Wasserstein gradient flow for functional $\mathcal{I}(\rho)+\mathcal{E}(\rho)$ . ∎

Several remarks are in order.

*Remark 1** (Comparison with the classical JKO scheme).*

Compared with the classical approach of JKO (5), our method does not require any inner time interpolation in the underlying dynamical formulation. It still preserves the first order time accuracy of the time discretization.

*Remark 2** (Schrödinger Bridge problem proximal).*

The variational problem (20) can be viewed as a Schrödinger bridge proximal method of Wasserstein gradient flows.

*Remark 3** (Comparison with entropic regularization of gradient flow [57]).*

We also compare our method with the entropic gradient flow studied in [14, 57]. A known fact is that when $\mathcal{H}(\rho)=\int\rho(x)\log\rho(x)\mathrm{d}x$ , the SBP problem has the static formulation [55]

[TABLE]

where $\alpha\geq 0$ is a constant and the infimum is over all joint histogram $\pi(x,y)\geq 0$ with marginals $\rho^{0}(x)$ , $\rho^{1}(y)$ . In [14, 57], the algorithm applies the above static formulation and considers the iterative regularization algorithm for the computation of gradient flow. Our formulation mainly uses the dynamical formulation of SBP, especially its time symmetric version in Proposition 2.

*Remark 4** (Generalized regularization functional).*

Besides using $\mathcal{H}(\rho)=\int\rho\log\rho\mathrm{d}x$ , we can also study other types of regularizations, e.g., $\mathcal{H}(\rho)=\frac{1}{(1-\gamma)(2-\gamma)}\int(\rho^{2-\gamma}-1)\mathrm{d}x$ . We leave these studies for future works.

3. Full discretization and optimization algorithm

In this section, we detail the spatial discretization and provide a complete algorithm for the fully discrete problem. The underlying principle for spatial discretization is to preserve the structure of Wasserstein metric tensor in the discrete sense so that it can be easily adapted to unstructured grid and more complicated equations with energy involving high order derivatives. Thanks to the Fisher information regularization, the resulting optimization is strictly convex and therefore gives access to second order Newton type optimization algorithms.

3.1. Spatial Discretization

To better explain the idea, we first consider the discretization in one spatial dimension on uniform grid. Let $[0,L]$ be the computational domain and $\Delta x$ and $\tau$ be the spatial grid and temporal step respectively. Choose $0=x_{\frac{1}{2}}<x_{\frac{3}{2}}<\cdots<x_{N_{x}+\frac{1}{2}}=L$ , and define

[TABLE]

where $x_{j}=j\Delta x,~{}x_{j+\frac{1}{2}}=(j+\frac{1}{2})\Delta x$ , and $t_{k}=k\tau$ . Note first that $m_{\frac{1}{2}}^{k}=m_{N_{x}+\frac{1}{2}}^{k}=0$ from the boundary condition, then the cost function in scheme (10) can be discretized as

[TABLE]

where $\mathcal{E}(\boldsymbol{\rho})$ in its general form reads

[TABLE]

Here $\boldsymbol{\rho}$ and $\mathbb{m}$ are vector representations of vectors $\rho_{j}$ and $m_{j}$ , i.e., $\boldsymbol{\rho}=(\rho_{1},\rho_{2},\cdots\rho_{N_{x}})$ , and $\mathbb{m}=(m_{1/2},m_{3/2},\cdots,m_{N_{x}+1/2})$ . The constraint is discretized with center difference in space as follows

[TABLE]

and the zero boundary conditions $m_{\frac{1}{2}}=m_{N_{x}+\frac{1}{2}}=0$ is applied.

Extension to two dimension is straightforward. Denote

[TABLE]

then the no-flux boundary condition on $m$ are imposed dimension by dimension, i.e.,

[TABLE]

The cost function then writes

[TABLE]

and the constraint becomes

[TABLE]

Generalization on graphs can be found in related studies [33, 34].

Upon spatial discretization, we therefore have the following finite dimensional variational problem:

[TABLE]

Here ${\bf i}$ is a vector of sub index (e.g., ${\bf i}=(j,l)$ in two dimension), ${\bf\Delta x}_{\bf i}=\Delta x$ or $\Delta y$ , ${\bf\Delta x}=\Pi_{{\bf i}}{\bf\Delta x}_{\bf i}$ . Written in this way, the discretization can be directly generalized to unstructured grid.

*Remark 5**.*

In practice, we will impose an non-negativity of $\rho:\rho_{\bf i}\geq 0$ to avoid unexpected negative solution when the optimization is not fully converged, i.e., the iteration terminates when the stopping criteria is met. However, as we will show in Theorem 5, the non-negativity shall be preserved when the underlying optimization is solved exactly.

Denote its minimizer as $(\boldsymbol{\rho}^{*},\mathbb{m}^{*})$ , then $\boldsymbol{\rho}^{k+1}=\boldsymbol{\rho}^{*}$ . We study the property of problem (26). Note that the constraints contain both equalities and inequalities, we will demonstrate that the Fisher information regularization plays the crucial role of penalty function, which enforces the density solution staying in the interior of probability simplex. We next prove several properties of the proposed algorithm.

Theorem 5.

For each $k\in\mathbb{N}_{+}$ , the following properties hold for scheme (26):

(i)

There exists a unique minimizer $\rho^{k+1}$ for the problem;

(ii)

The modified energy decays

[TABLE]

(iii)

There exists a constant $c>0$ , such that

[TABLE]

(iv)

The total mass is conserved

[TABLE]

Proof.

(i) The proof is based on the result of [50]. For the completeness of paper, we present it here. We shall show that

(1)

The discrete Fisher information functional is shown to be positive infinity on the boundary of the probability set. Thus the minimizer of (26) is obtained in the interior of simplex.

(2)

The optimization problem (26) is strict convex in the interior of the constraint.

For notational convenience, we denote

[TABLE]

We first show that the minimizer of (26) in term of $\rho$ is strictly positive. This is true since $\mathcal{I}(\rho)$ is positive infinity on the boundary of simplex set, i.e.

[TABLE]

where $V$ is the vertices set of the discretization. Suppose the above is not true, there exists a constant $M>0$ , such that if there exists some $i^{*}\in V$ , $\rho_{i^{*}}=0$ , then

[TABLE]

where $E$ is the edge set of the discretization. Notice that each term in (27) is non-negative, thus

[TABLE]

for any edge $(i,i+e_{v})\in E$ . Since $\rho_{i^{*}}=0$ , the above formula further implies that for any $\tilde{\imath}\in N(i^{*})$ , $\rho_{\tilde{i}}=0$ . This is true since if $\rho_{i^{*}}\neq 0$ , we have

[TABLE]

Similarly, we show that for any nodes $\tilde{\tilde{\imath}}\in N(\tilde{\imath})$ , $\rho_{\tilde{\tilde{\imath}}}=0$ . Here $N(\tilde{\imath})$ is the neighborhood of node $\tilde{\imath}$ in the discretization grids. We iterate the above steps a finite number of times. Since the lattice graph is connected and the set $V$ is finite, we obtain $\rho_{i}=0$ , for any $i\in V$ . This contradicts the assumption that $\sum_{i\in V}\rho_{i}=\textrm{Constant}$ , which finishes the proof.

We now prove that $\mathcal{I}(\rho)$ is strictly convex in the variable $\rho$ with a constraint $\sum_{i\in V}\rho_{i}=\textrm{Constant}$ , $\rho_{i}>0$ , for any $i\in V$ . We shall show

[TABLE]

Here $\mathcal{I}_{\rho\rho}=(\frac{\partial^{2}\mathcal{I}(\rho)}{\partial\rho_{i}\partial\rho_{j}})_{i\in V,j\in V}\in\mathbb{R}^{|V|\times|V|}$ , and $\sum_{i\in V}\sigma_{i}=0$ is the constraint for $\rho$ lying on the simplex set. Notice the fact that

[TABLE]

where

[TABLE]

Hence

[TABLE]

where $\frac{1}{2}$ is due to the convention that each edge $(i,j)\in E$ is summed twice.

We next show that the strict inequality in (28) holds. Suppose (28) is not true, there exists a unit vector $\sigma^{*}$ such that

[TABLE]

Then $\frac{\sigma_{1}^{*}}{\rho_{1}}=\frac{\sigma_{2}^{*}}{\rho_{2}}=\cdots\frac{\sigma_{n}^{*}}{\rho_{|V|}}$ . Combining this with the constraint $\sum_{i\in V}\sigma_{i}^{*}=0$ , we have $\sigma_{1}^{*}=\sigma_{2}^{*}=\cdots=\sigma_{|V|}^{*}=0$ , which contradicts that $\sigma^{*}$ is a unit vector.

Second, we show that $\mathcal{K}(m,\rho)+\beta^{-2}\tau^{2}\mathcal{I}(\rho)$ is strictly convex in $(m,\rho)$ . Notice that $(m,\rho)$ is in the interior of optimization domain, we have $\rho_{i}>0$ , thus the objective function is smooth. We shall show that $\lambda(m,\rho)>0$ , where

[TABLE]

subject to

[TABLE]

Here, $\lambda(m,\rho)$ is the smallest eigenvalue of Hessian matrix for the objective function with tangent vectors $(h,\sigma)$ . We last show that $\mathcal{K}(m,\rho)$ is a smooth, convex function in the interior of simplex set. We have

[TABLE]

Since $\frac{x^{2}}{y}$ is convex when $y>0$ and $\rho_{i}+\rho_{i+e_{v}}$ is concave on variables $\rho_{i}$ , $\rho_{i+e_{v}}>0$ . Then $\mathcal{K}$ is convex. From (28), we have

[TABLE]

We claim that the inequality in (31) is strict. Suppose there exists $(h^{*},\sigma^{*})$ , such that (31) is zero, i.e.

[TABLE]

In this case, from (28), $\sigma^{*}=0$ . Thus (31) forms

[TABLE]

Since $\mathcal{K}_{mm}=\textrm{diag}(\frac{4}{\rho_{i}+\rho_{i+e_{v}}})_{i+\frac{e_{v}}{2}\in E}$ is strictly positive, we have $h^{*}=0$ , which contradicts the fact that $h^{T}h+\sigma^{T}\sigma=1$ . From the above statements, we prove that there exists a unique solution $\rho^{k+1}$

(ii) Denote $(m^{*},\rho^{k+1})$ as the minimizer of variation problem (26). Then

[TABLE]

This further implies

[TABLE]

which finishes the proof.

(iii) holds since $\mathcal{I}(\rho)$ goes to infinity on the boundary of simplex set.

(iv) is true, because the continuity equation in (26) satisfies

[TABLE]

This finishes the proof. ∎

3.2. Optimization method

To solve (26), we first rewrite our problem into a vector form. Let $u=(\boldsymbol{\rho},\mathbb{m})$ , then (26) can be written as

[TABLE]

where $F(u)$ is defined in (22), $\mathsf{A}$ is the matrix representation of the constraint (23) or (25), and $\mathsf{S}$ is a selection matrix that only selects the $\rho$ components in $u$ . Let $\chi$ be the indicator function, then (32) can be further reformulated as

[TABLE]

Here $F(u)$ defined in either (22) or (24) is a smooth, convex function (provided $E$ is convex), and indicator function $\chi$ is also convex. Therefore we adopt the (approximate) sequential quadratic programming to solve it:

[TABLE]

where $\mathsf{H}^{(l)}$ is either the Hessian $\nabla^{2}F(u^{(l)})$ or an approximation of it.

[TABLE]

There are several approaches to solve the subproblem in line 6. Among them, we have tried interior point method, projected preconditioned conjugate gradient method [43], and first order fast iterative shrinkage thresholding algorithm (FISTA) [4]. In our case where $\mathsf{H}^{(l)}$ is sparse but ill conditioned, we found that the MATLAB built-in function ‘quadprog’ with interior point solver performs the best.

3.3. Convergence

In this section, we analyze the convergence of (34), especially the role that $\mathsf{H}$ plays. We have the following assumptions:

(A1)

$m\mathsf{I}\preceq\nabla^{2}F\preceq M\mathsf{I}$ , $M\geq m>0$ ;

(A2)

the subproblem in (34) is solved exactly.

First we note that $u^{(l+1)}$ can be rewritten as

[TABLE]

where $\|u\|_{\mathsf{H}^{(l)}}=u^{T}{\mathsf{H}^{(l)}}u$ . Further, let $u^{*}$ be the unique minimizer to (33), then $u^{*}$ solves

[TABLE]

where $t>0$ and $\mathsf{H}$ is any positive definite matrix. Our first result is, when ${\mathsf{H}^{(l)}}$ is only an approximation of $\nabla^{2}F(u^{(l)})$ , we get first order convergence with convergence rate depends on the condition number ${\mathsf{H}^{(l)}}^{-1}\nabla^{2}F(u^{(l)})$ . More specifically, we have

Theorem 6.

Consider uniform time step $t$ . Let $G_{l}=\int_{0}^{1}\nabla^{2}F(u^{*}+s(u_{l}-u^{*}))\mathrm{d}s$ , then

[TABLE]

where $\kappa$ is the condition number of ${\mathsf{H}^{(l)}}^{-1}G_{l}$ .

To prove the above theorem, we need the following lemma on the contraction of the proximal operator.

Lemma 7.

If $u=\text{prox}_{\chi}^{\mathsf{H}}(x)$ , $v=\text{prox}_{\chi}^{\mathsf{H}}(y)$ , where $\chi$ is a convex function, and $\mathsf{H}$ is a positive definite matrix, then we have $(u-v)^{T}\mathsf{H}(x-y)\geq\|u-v\|_{\mathsf{H}}^{2}$ . Consequently, $\|u-v\|_{\mathsf{H}}\leq\|x-y\|_{\mathsf{H}}$ .

The proof of this lemma is standard, so we omit the details and directly jump to the proof of Theorem 6.

Proof of Theorem 6.

By virtue of (35), and (36) with $\mathsf{H}={\mathsf{H}^{(l)}}$ , we have

[TABLE]

where the first inequality uses Lemma 7. Since both ${\mathsf{H}^{(l)}}$ and $G_{l}$ are positive definite from assumption (A1), we denote $\lambda_{1}\geq\lambda_{2}\geq\cdots\geq\lambda_{N}>0$ as the eigenvalues of ${\mathsf{H}^{(l)}}^{-1}G_{l}$ , then

[TABLE]

Here we have used the fact that for two symmetric positive semi-definite matrix $\mathsf{A}$ and $\mathsf{B}$ , if $A\preceq B$ , then $\|\mathsf{A}\|_{{\mathsf{H}^{(l)}}}\leq\|\mathsf{B}\|_{{\mathsf{H}^{(l)}}}$ . Indeed, since $\|\mathsf{A}\|_{{\mathsf{H}^{(l)}}}=\sup_{x}{x^{T}\mathsf{A}^{T}{\mathsf{H}^{(l)}}\mathsf{A}x}/{x^{T}{\mathsf{H}^{(l)}}x}$ , let $y={\mathsf{H}^{(l)}}^{\frac{1}{2}}x$ , we have $\|\mathsf{A}\|_{{\mathsf{H}^{(l)}}}=\sup_{y}{y^{T}{\mathsf{H}^{(l)}}^{-\frac{1}{2}}\mathsf{A}^{T}{\mathsf{H}^{(l)}}^{\frac{1}{2}}{\mathsf{H}^{(l)}}^{\frac{1}{2}}\mathsf{A}{\mathsf{H}^{(l)}}^{-\frac{1}{2}}y}/{y^{T}y}$ , therefore, $\|\mathsf{A}\|_{{\mathsf{H}^{(l)}}}=\|{\mathsf{H}^{(l)}}^{-\frac{1}{2}}\mathsf{A}{\mathsf{H}^{(l)}}^{\frac{1}{2}}\|_{2}=\|\mathsf{A}\|_{2}$ .

Choose $t=\frac{2}{\lambda_{1}+\lambda_{N}}$ in (39) so that it minimize its RHS, and plug it into (38) to get the final result. ∎

*Remark 6** (Comparison with proximal gradient).*

Consider the proximal gradient method for solving (33)

[TABLE]

Comparing it to (35), we see that (35) is a preconditioned version of (40). Indeed, substituting (36) with $\mathsf{H}=\mathsf{I}$ from (40), we get

[TABLE]

where $\kappa_{G}$ is the condition number of $G_{l}$ . Therefore when $G_{l}$ is ill-conditioned, which is the case in the presence of vacuum due to the nonlinear diffusion, the convergence rate in (41) is much slower than that in (37).

*Remark 7** (Choice of ${\mathsf{H}^{(l)}}$ ).*

In our problems when the energy term $\mathcal{E}$ only contains internal and potential energies, both of which are local in $\rho$ , we directly compute the Hessian of $F$ as $\mathsf{H}$ , since in this case the Hessian is sparse and very cheap to compute. When $\mathcal{E}$ also contains interaction energy, the Hessian of $F$ is dense, and we instead approximate the Hessian of $F$ by replacing the interaction energy with entropy $\int\rho\log\rho\mathrm{d}x$ , and adjusting the parameter in the Fisher information term to approximate the original Hessian. More specifically, for the general case where $F(u)$ is

[TABLE]

We compute the Hessian $\mathsf{H}$ of

[TABLE]

as an approximation of $\nabla^{2}F$ . Here $\tilde{\beta}^{-2}$ is an integer multiple of $\beta^{-2}$ .

We close this subsection by stating following result that when ${\mathsf{H}^{(l)}}$ is exact Hessian of $F$ , we obtain local quadratic convergence. We omit the proof, which is standard.

Theorem 8.

Assume further that $\|\nabla^{2}F(x)-\nabla^{2}F(y)\|_{2}\leq L\|x-y\|_{2}$ . If ${\mathsf{H}^{(l)}}=\nabla^{2}F(u^{(l)})$ in (34), then for sufficiently large $l$ , $t_{l}\rightarrow 1$ , and $u^{(l)}$ satisfies

[TABLE]

4. Numerical examples

In this section, we demonstrate several numerical examples to show the accuracy and efficiency of the proposed scheme (26). The stopping criteria in the sub optimization problem (see line 9 in the Algorithm) is chosen as

[TABLE]

where TOL is set to be $10^{-6}$ unless otherwise specified.

4.1. 1D problem

4.1.1. Heat equation

For heat equation, we directly choose $\beta=1$ , and let the initial condition be

[TABLE]

In Fig. 1 on the left, we apply (26) on a coarse mesh and compare the solution with the reference solution obtained by implicit diffusion solver on a fine mesh, and observe good agreements. When $\Delta x$ is sufficiently small, we check the first order accuracy of our scheme by computing the following relative error

[TABLE]

and error with respect to the reference solution

[TABLE]

with decreasing $\tau$ .

4.1.2. Porous medium equation

The porous medium equation

[TABLE]

can be considered as the Wasserstein gradient flow of the energy (2), with ${U}(\rho)=\frac{1}{m-1}\rho^{m}$ and $V=W=0\,$ . A well-known family of exact solutions is given by Barenblatt profiles (c.f. [60]), which are densities of the form

[TABLE]

In our tests, we choose $m=2$ , $t_{0}=10^{-3}$ and $C=0.8$ . We plot the evolution of the numerical solution over time in Fig. 2, and we observe good agreement with the exact solution of the form (46), which is shown in dashed curve.

Next, we examine how the entropic regularization affects the solution. In the left plot of Fig. 3, we compare solutions obtained by our scheme with various $\beta^{-1}$ and we observe that near the boundary of the solutions’ support where a non-smooth transition is expected (see the black dashed curve for the exact solution), our solution with regularization inevitably smooth out the solution. As $\beta^{-1}$ decreases, the solution improves moderately. On the right, we compare the error between our solution with the exact formula (46):

[TABLE]

As expected, smaller $\beta^{-1}$ leads to better accuracy. However, as the regularization parameter is closely related to the convexity of the problem and thus affects the convergence of the method, one has to strike a balance between the accuracy and efficiency by choosing $\beta^{-1}$ neither too big nor too small.

4.1.3. Nonlinear Fokker-Planck equation

Next, we consider a nonlinear variant of the Fokker-Planck equation, by replacing the linear diffusion with the porous medium type nonlinear diffusion (45):

[TABLE]

When $V$ is a confining drift potential, all solutions approach the unique steady state

[TABLE]

where $C>0$ depends on the mass of the initial data, i.e., denote $M=\int\rho_{0}\mathrm{d}x$ , then $C=\left(\frac{3M}{8}\right)^{2/3}$ , see [25, 21] for a derivation.

In Figure 4, we compute the solutions to the nonlinear Fokker-Planck equation with $V(x)=\frac{x^{2}}{2}$ , $m=2$ , and initial data given by $\rho(x,0)=\frac{1}{8}\left(\frac{1}{\sqrt{2\pi}\sigma}e^{-x^{2}/2\sigma^{2}}+10^{-8}\right)$ . On the left, we plot the evolution of the density $\rho(x,t)$ towards the steady state $\rho_{\infty}(x)$ . On the right, we compute the rate of decay of the corresponding energy (2) as a function of time, observing exponential decay as the solution approaches equilibrium, which is consistent with the analytic results on convergence to equilibrium [25, 18].

4.1.4. Aggregation equation

In this subsection, we consider a nonlocal aggregation equation of the form

[TABLE]

where the interaction kernel $W$ is repulsive at short length scales and attractive at longer distances. In particular, we choose the following kernel with logarithmic repulsion and quadratic attraction

[TABLE]

then it is proved that there exists a unique equilibrium profile [28], given by

[TABLE]

In practice, to avoid evaluation of $W(x)$ at $x=0$ , we set $W(0)$ to equal the average value of $W$ on the cell of width $2h$ centered at 0, i.e., $W(0)=\frac{1}{2h}\int_{-h}^{h}W(x)\mathrm{d}x$ , where we compute this value analytically. (See also [26, 27] for a similar treatment.)

The numerical results are gathered in Fig. 5. On the left, we simulate the solution to the aggregation equation with Gaussian initial data $\rho(x,0)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{x^{2}}{2\sigma^{2}}}+10^{-8}$ at varying times, observing convergence to the equilibrium profile $\rho_{\infty}(x)$ . On the right, we compute the rate of the decay of the energy as a function of time, observing exponential decay with the theoretical rate as obtained by Carrillo et. al. [28].

4.1.5. Derrida-Lebowitz-Speer-Spohn (DLSS) equation

We now consider a DLSS equation

[TABLE]

where $\mathcal{I}(\rho)=\int_{\Omega}|\nabla\log\rho(x)|^{2}\rho(x)\mathrm{d}x$ and $\delta$ is the first variation operator with

[TABLE]

As written, the DLSS equation can be considered as the Wasserstein gradient flow of functional: $\mathcal{E}(\rho)=\int_{\mathbb{R}^{n}}\frac{1}{2}\|\nabla\log\rho(x)\|^{2}\rho(x)+V(x)\rho(x)\mathrm{d}x$ . In practice, we just replace $\beta^{-2}\tau^{2}$ by $\tau$ and choose $\mathcal{E}(\rho)=\int V(x)\rho(x)\mathrm{d}x$ in (10).

When $V(x)=\frac{x^{2}}{2}$ , the stationary solution $\rho_{\infty}$ has an explicit form

[TABLE]

With double-Gaussian initial condition

[TABLE]

we plot the results in Fig. 6. On the left, one sees an evolution of $\rho$ towards the equilibrium (51); on the right, an exponential convergence of the energy $E(\rho)$ is demonstrated.

Likewise, for a double-well potential $V(x)=10(1-x^{2})^{2}$ , with the same initial condition, we collect the results in Fig.7. Unlike the previous case, the steady state here has two bumps.

4.2. 2D problem

4.2.1. Aggregation equation

We first consider aggregation equation (49) with attractive-repulsive potentials in two dimensions with interaction kernel

[TABLE]

where $\frac{|x|^{0}}{0}=\ln(|x|)$ . In this case, the repulsion near the origin determines the dimension of the support of the steady state measure, see [2, 17].

In the first example, we choose $a=4$ , $b=2$ , and take the initial data to be a Gaussian

[TABLE]

with mean $x^{0}=(1.25,1.25)$ and variance $\theta=0.2$ . Here the steady state concentrates on a Dirac ring with radius 0.5 centered at $\rho^{0}$ , recovering analytical results on the existence of a stable Dirac ring equilibrium [8]. We also compare the convergence in the first outer JKO time step of our regularized sequential quadratic programming with the un-regularized primal dual method [27] in Fig. 9, and a much faster convergence in Newton’s method is observed.

In the second example, we consider interaction kernel (52) with different parameters: $a=2$ and $b=0$ and the results are displayed in Fig. 10. We observe that the solution converges to a characteristic function on the disk of radius 1, centered at $x^{0}$ , recovering analytic results on solutions of the aggregation equation with Newtonian repulsion [38, 9].

4.2.2. Aggregation drift equation

We compute solutions of aggregation-drift equations

[TABLE]

where $W(x)=\frac{|x|^{2}}{2}-\ln(|x|)$ and $V(x)=-\frac{1}{4}\ln(|x|)$ . As shown in the analytical results [30, 20], the steady state is a characteristic function on a torus, with inner and outer radius given by $R_{1}=\frac{1}{2}$ , $R_{2}=\sqrt{\frac{5}{4}}$ . The initial condition consists of five Gaussians, which is non-radially symmetric. The evolution of $\rho$ towards equilibrium along with the energy decay in time are displayed in Fig. 11.

4.2.3. Aggregation diffusion equation

Consider the aggregation diffusion equations

[TABLE]

When the interaction kernel $W$ is attractive, the competition between the nonlocal aggregation $\nabla\cdot(\rho\nabla W*\rho)$ and nonlinear diffusion $\nu\Delta\rho^{m}$ causes solutions to behave differently in various regimes—either finite time blow up or globally exist in time, see the survey [16]. In Fig. 12, we take $W(x)=-\frac{e^{-|x|^{2}}}{\pi}$ , $m=3$ , $\nu=0.1$ . Computational domain is chosen as $[-3,3]^{2}$ , initial data is $\rho(0,x,y)=\chi_{|x|\leq 2.5,|y|\leq 2.5}$ .

4.2.4. The DLSS model

We close the section by computing a two dimensional DLSS equation

[TABLE]

with $V(x)=\frac{|x|^{2}}{2}$ and initial condition consisting of four Gaussians. In this case, we do not need a regularization and Hessian is computed exactly. As seen in Fig. 13, the density $\rho$ converges to the equilibrium $\rho_{\infty}=\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}$ very rapidly. In the bottom center plot, we also compare slice of the steady state computed via our method with the exact equilibrium, and observe a good match.

5. Discussion

In this paper, we propose a variational time discretization scheme for Wasserstein gradient flows. The scheme applies the quadric approximation of Wasserstein-2 metric and introduces the Fisher information regularization into the iterative regularization. In discrete grids, this regularized term helps the gradient flow path to maintain positivity during the evolution and further improves the convexity of the variational problem.

Acknowledgement: WL was partially supported by AFOSR MURI FA9550-18-1-0502. JL was partially supported by NSF under grant DMS-1454939. LW was partially supported by NSF grant DMS-1903420 and NSF CAREER grant DMS-1846854. The authors are grateful to the support from KI-Net (NSF grant RNMS-1107444) and UMN-Math Visitors Program to facilitate the collaboration.

Bibliography60

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] R. Bailo, J. A. Carrillo, and J. Hu. Fully discrete positivity-preserving and energy-dissipative schemes for nonlinear nonlocal equations with a gradient flow structure. preprint ar Xiv: , 2018.
2[2] D. Balagué, J. A. Carrillo, T. Laurent, and G. Raoul. Dimensionality of local minimizers of the interaction energy. Arch. Ration. Mech. Anal. , 209(3):1055–1088, 2013.
3[3] Alethea B. T. Barbaro, José A. Cañizo, José A. Carrillo, and Pierre Degond. Phase transitions in a kinetic flocking model of Cucker-Smale type. Multiscale Model. Simul. , 14(3):1063–1088, 2016.
4[4] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences , 2(1):183–202, 2009.
5[5] J.-D. Benamou and Y. Brenier. A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem. Numer. Math. , 84:375–393, 2000.
6[6] J.-D. Benamou, G. Carlier, and M. Laborde. An augmented Lagrangian approach to Wasserstein gradient flows and applications. ESAIM: PROCEEDINGS AND SURVEYS , 54:1–17, 2016.
7[7] Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the monge-kantorovich mass transfer problem. Numerische Mathematik , 84(3):375–393, Jan 2000.
8[8] Andrea L. Bertozzi, Theodore Kolokolnikov, Hui Sun, David Uminsky, and James von Brecht. Ring patterns and their bifurcations in a nonlocal model of biological swarms. Commun. Math. Sci. , 13(4):955–985, 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Fisher information regularization schemes for Wasserstein gradient flows

Abstract.

Key words and phrases:

1. Introduction

2. Semi-discretization with Fisher information regularization

2.1. Schrödinger Bridge problem and Fisher regularization

Definition 1** (Schrödinger bridge problem).**

Proposition 2** (Fisher information regularization).**

Proof.

2.2. Energy splitting and time discretization

Theorem 3** (Fisher information regularization scheme).**

Proof.

Proposition 4** (Exact cases).**

Proof.

Remark 1* (Comparison with the classical JKO scheme).*

Remark 2* (Schrödinger Bridge problem proximal).*

Remark 3* (Comparison with entropic regularization of gradient flow [57]).*

Remark 4* (Generalized regularization functional).*

3. Full discretization and optimization algorithm

3.1. Spatial Discretization

Remark 5*.*

Theorem 5**.**

Proof.

3.2. Optimization method

3.3. Convergence

Theorem 6**.**

Lemma 7**.**

Proof of Theorem 6.

Remark 6* (Comparison with proximal gradient).*

Remark 7* (Choice of H(l){\mathsf{H}^{(l)}}H(l)).*

Theorem 8**.**

4. Numerical examples

4.1. 1D problem

4.1.1. Heat equation

4.1.2. Porous medium equation

4.1.3. Nonlinear Fokker-Planck equation

4.1.4. Aggregation equation

4.1.5. Derrida-Lebowitz-Speer-Spohn (DLSS) equation

4.2. 2D problem

4.2.1. Aggregation equation

4.2.2. Aggregation drift equation

4.2.3. Aggregation diffusion equation

4.2.4. The DLSS model

5. Discussion

Definition 1 (Schrödinger bridge problem).

Proposition 2 (Fisher information regularization).

Theorem 3 (Fisher information regularization scheme).

Proposition 4 (Exact cases).

*Remark 1** (Comparison with the classical JKO scheme).*

*Remark 2** (Schrödinger Bridge problem proximal).*

*Remark 3** (Comparison with entropic regularization of gradient flow [57]).*

*Remark 4** (Generalized regularization functional).*

*Remark 5**.*

Theorem 5.

Theorem 6.

Lemma 7.

*Remark 6** (Comparison with proximal gradient).*

*Remark 7** (Choice of ${\mathsf{H}^{(l)}}$ ).*

Theorem 8.