The Bayesian update: variational formulations and gradient flows

Nicolas Garcia Trillos; Daniel Sanz-Alonso

arXiv:1705.07382·math.ST·November 5, 2018

The Bayesian update: variational formulations and gradient flows

Nicolas Garcia Trillos, Daniel Sanz-Alonso

PDF

TL;DR

This paper explores the variational and gradient flow perspectives of Bayesian updates, introducing new tools for analyzing convergence and proposing novel MCMC proposal strategies based on these insights.

Contribution

It formalizes the connection between Bayesian posteriors, variational functionals, and gradient flows, and introduces a criterion for metric choice in Riemannian MCMC methods.

Findings

01

Convergence rates are bounded by geodesic convexity of the functionals.

02

Gradient flows lead to nonlinear diffusions with the posterior as invariant distribution.

03

Proposed a criterion for metric selection in Riemannian MCMC.

Abstract

The Bayesian update can be viewed as a variational problem by characterizing the posterior as the minimizer of a functional. The variational viewpoint is far from new and is at the heart of popular methods for posterior approximation. However, some of its consequences seem largely unexplored. We focus on the following one: defining the posterior as the minimizer of a functional gives a natural path towards the posterior by moving in the direction of steepest descent of the functional. This idea is made precise through the theory of gradient flows, allowing to bring new tools to the study of Bayesian models and algorithms. Since the posterior may be characterized as the minimizer of different functionals, several variational formulations may be considered. We study three of them and their three associated gradient flows. We show that, in all cases, the rate of convergence of the flows to…

Equations240

J_{\mbox{\tiny{\rm KL}}}\bigl{(}q(u)\bigr{)}=D_{\mbox{\tiny{\rm KL}}}\bigl{(}q(u)\|p(u)\bigr{)}-\int\log L(y|u)q(u)du.

J_{\mbox{\tiny{\rm KL}}}\bigl{(}q(u)\bigr{)}=D_{\mbox{\tiny{\rm KL}}}\bigl{(}q(u)\|p(u)\bigr{)}-\int\log L(y|u)q(u)du.

π = e^{- Ψ} v o l_{g}, μ \propto e^{- ϕ} π .

π = e^{- Ψ} v o l_{g}, μ \propto e^{- ϕ} π .

\displaystyle\begin{split}\mathcal{P}_{2}(\mathcal{M})&:=\Bigl{\{}\nu\in\mathcal{P}(\mathcal{M})\>:\>\int_{\mathcal{M}}d^{2}(x,x_{0})d\nu(x)<\infty,\,\,\text{ for some }x_{0}\in\mathcal{M}\Bigr{\}},\\ \mathcal{W}_{2}^{2}(\nu_{1},\nu_{2})&:=\inf_{\alpha}\int_{\mathcal{M}\times\mathcal{M}}d(x,y)^{2}d\alpha(x,y),\quad\nu_{1},\nu_{2}\in\mathcal{P}_{2}(\mathcal{M}).\end{split}

\displaystyle\begin{split}\mathcal{P}_{2}(\mathcal{M})&:=\Bigl{\{}\nu\in\mathcal{P}(\mathcal{M})\>:\>\int_{\mathcal{M}}d^{2}(x,x_{0})d\nu(x)<\infty,\,\,\text{ for some }x_{0}\in\mathcal{M}\Bigr{\}},\\ \mathcal{W}_{2}^{2}(\nu_{1},\nu_{2})&:=\inf_{\alpha}\int_{\mathcal{M}\times\mathcal{M}}d(x,y)^{2}d\alpha(x,y),\quad\nu_{1},\nu_{2}\in\mathcal{P}_{2}(\mathcal{M}).\end{split}

D_{\mbox KL} (ν_{1} ∥ ν_{2}) := {\int_{M} \frac{d ν _{1}}{d ν _{2}} (u) lo g (\frac{d ν _{1}}{d ν _{2}} (u)) d ν_{2} (u), \infty, ν_{1} ≪ ν_{2}, otherwise,

D_{\mbox KL} (ν_{1} ∥ ν_{2}) := {\int_{M} \frac{d ν _{1}}{d ν _{2}} (u) lo g (\frac{d ν _{1}}{d ν _{2}} (u)) d ν_{2} (u), \infty, ν_{1} ≪ ν_{2}, otherwise,

D_{\mbox{\tiny{\rm$\chi^{2}$}}}(\nu_{1}\|\nu_{2}):=\begin{cases}\int_{\mathcal{M}}\Bigl{(}\frac{d\nu_{1}}{d\nu_{2}}(u)-1\Bigr{)}^{2}d\nu_{2}(u),&\nu_{1}\ll\nu_{2},\\ \infty,&\text{otherwise};\end{cases}

D_{\mbox{\tiny{\rm$\chi^{2}$}}}(\nu_{1}\|\nu_{2}):=\begin{cases}\int_{\mathcal{M}}\Bigl{(}\frac{d\nu_{1}}{d\nu_{2}}(u)-1\Bigr{)}^{2}d\nu_{2}(u),&\nu_{1}\ll\nu_{2},\\ \infty,&\text{otherwise};\end{cases}

J (ν) := \int_{M} h (u) d ν (u),

J (ν) := \int_{M} h (u) d ν (u),

D^{μ} (f) = {\int_{M} ∥ \nabla_{g} f (u) ∥^{2} d μ (u), + \infty, f \in L^{2} (M, μ) \cap H^{1} (M), otherwise .,

D^{μ} (f) = {\int_{M} ∥ \nabla_{g} f (u) ∥^{2} d μ (u), + \infty, f \in L^{2} (M, μ) \cap H^{1} (M), otherwise .,

E\bigl{(}\gamma(t)\bigr{)}\leq(1-t)E(x_{0})+tE(x_{1})-\lambda\frac{t(1-t)}{2}d_{X}^{2}(x_{0},x_{1}),\quad\forall t\in[0,1].

E\bigl{(}\gamma(t)\bigr{)}\leq(1-t)E(x_{0})+tE(x_{1})-\lambda\frac{t(1-t)}{2}d_{X}^{2}(x_{0},x_{1}),\quad\forall t\in[0,1].

\displaystyle\begin{cases}\dot{x}(t)&=\,\,-\nabla E\bigl{(}x(t)\bigr{)},\quad t\geq 0,\\ x(0)&=\,\,x_{0}.\end{cases}

\displaystyle\begin{cases}\dot{x}(t)&=\,\,-\nabla E\bigl{(}x(t)\bigr{)},\quad t\geq 0,\\ x(0)&=\,\,x_{0}.\end{cases}

E(x_{0})=E\bigl{(}x(t)\bigr{)}+\frac{1}{2}\int_{0}^{t}\left|\dot{x}(r)\right|^{2}dr+\frac{1}{2}\int_{0}^{t}\left|\nabla E\bigl{(}x(r)\bigr{)}\right|^{2}dr,\quad t>0.

E(x_{0})=E\bigl{(}x(t)\bigr{)}+\frac{1}{2}\int_{0}^{t}\left|\dot{x}(r)\right|^{2}dr+\frac{1}{2}\int_{0}^{t}\left|\nabla E\bigl{(}x(r)\bigr{)}\right|^{2}dr,\quad t>0.

|\dot{x}(t)|:=\lim_{s\rightarrow t}\frac{d_{X}\bigl{(}x(t),x(s)\bigr{)}}{|s-t|},

|\dot{x}(t)|:=\lim_{s\rightarrow t}\frac{d_{X}\bigl{(}x(t),x(s)\bigr{)}}{|s-t|},

|\nabla E|(x):=\limsup_{y\to x}\frac{\bigl{(}E(x)-E(y)\bigr{)}^{+}}{d_{X}(x,y)}.

|\nabla E|(x):=\limsup_{y\to x}\frac{\bigl{(}E(x)-E(y)\bigr{)}^{+}}{d_{X}(x,y)}.

\mu(du)=\frac{1}{Z}\exp\bigl{(}-\phi(u)\bigr{)}\pi(du)

\mu(du)=\frac{1}{Z}\exp\bigl{(}-\phi(u)\bigr{)}\pi(du)

J_{\mbox KL} (ν) := D_{\mbox KL} (ν ∥ π) + F_{\mbox KL} (ν; ϕ), ν \in P (M),

J_{\mbox KL} (ν) := D_{\mbox KL} (ν ∥ π) + F_{\mbox KL} (ν; ϕ), ν \in P (M),

F_{\mbox KL} (ν; ϕ) := \int_{M} ϕ (u) d ν (u),

F_{\mbox KL} (ν; ϕ) := \int_{M} ϕ (u) d ν (u),

D_{\mbox KL} (\cdot ∥ μ) = J_{\mbox KL} (\cdot) + lo g Z .

D_{\mbox KL} (\cdot ∥ μ) = J_{\mbox KL} (\cdot) + lo g Z .

J_{\mbox{\tiny{\rm$\chi^{2}$}}}(\nu):=D_{\mbox{\tiny{\rm$\chi^{2}$}}}(\nu\|\pi)+F_{\mbox{\tiny{\rm$\chi^{2}$}}}(\nu;\phi),\quad\nu\in\mathcal{P}(\mathcal{M}),

J_{\mbox{\tiny{\rm$\chi^{2}$}}}(\nu):=D_{\mbox{\tiny{\rm$\chi^{2}$}}}(\nu\|\pi)+F_{\mbox{\tiny{\rm$\chi^{2}$}}}(\nu;\phi),\quad\nu\in\mathcal{P}(\mathcal{M}),

F_{\mbox{\tiny{\rm$\chi^{2}$}}}(\nu;\phi):=\int_{\mathcal{M}}\tilde{\phi}(u)d\nu(u),\quad\tilde{\phi}=g\bigl{(}\exp(\phi(u))\bigr{)},\quad g(t):=t-1,\quad t>0.

F_{\mbox{\tiny{\rm$\chi^{2}$}}}(\nu;\phi):=\int_{\mathcal{M}}\tilde{\phi}(u)d\nu(u),\quad\tilde{\phi}=g\bigl{(}\exp(\phi(u))\bigr{)},\quad g(t):=t-1,\quad t>0.

D^{μ} (f) = \int_{M} ∥ \nabla_{g} f (u) ∥^{2} d μ (u),

D^{μ} (f) = \int_{M} ∥ \nabla_{g} f (u) ∥^{2} d μ (u),

Ric_{g} (v, v) + Hess_{g} Ψ (v, v) + Hess_{g} ϕ (v, v) \geq λ, \forall x \in M, \forall v \in T_{x} M with g (v, v) = 1,

Ric_{g} (v, v) + Hess_{g} Ψ (v, v) + Hess_{g} ϕ (v, v) \geq λ, \forall x \in M, \forall v \in T_{x} M with g (v, v) = 1,

D_{\mbox KL} (μ_{t} ∥ μ) W_{2} (μ_{t}, μ)^{2} \leq e^{- λ t} D_{\mbox KL} (μ_{0} ∥ μ), t \geq 0, \leq λ e^{- λ t} D_{\mbox KL} (μ_{0} ∥ μ), t \geq 0.

D_{\mbox KL} (μ_{t} ∥ μ) W_{2} (μ_{t}, μ)^{2} \leq e^{- λ t} D_{\mbox KL} (μ_{0} ∥ μ), t \geq 0, \leq λ e^{- λ t} D_{\mbox KL} (μ_{0} ∥ μ), t \geq 0.

\displaystyle\begin{split}J_{\mbox{\tiny{\rm$\chi^{2}$}}}(\mu_{t})-J_{\mbox{\tiny{\rm$\chi^{2}$}}}(\mu)&\leq e^{-\lambda t}\bigl{(}J_{\mbox{\tiny{\rm$\chi^{2}$}}}(\mu_{0})-J_{\mbox{\tiny{\rm$\chi^{2}$}}}(\mu)\bigr{)},\quad t\geq 0,\\ \mathcal{W}_{2}(\mu_{t},\mu)^{2}&\leq\lambda e^{-\lambda t}\bigl{(}J_{\mbox{\tiny{\rm$\chi^{2}$}}}(\mu_{0})-J_{\mbox{\tiny{\rm$\chi^{2}$}}}(\mu)\bigr{)},\quad t\geq 0.\end{split}

\displaystyle\begin{split}J_{\mbox{\tiny{\rm$\chi^{2}$}}}(\mu_{t})-J_{\mbox{\tiny{\rm$\chi^{2}$}}}(\mu)&\leq e^{-\lambda t}\bigl{(}J_{\mbox{\tiny{\rm$\chi^{2}$}}}(\mu_{0})-J_{\mbox{\tiny{\rm$\chi^{2}$}}}(\mu)\bigr{)},\quad t\geq 0,\\ \mathcal{W}_{2}(\mu_{t},\mu)^{2}&\leq\lambda e^{-\lambda t}\bigl{(}J_{\mbox{\tiny{\rm$\chi^{2}$}}}(\mu_{0})-J_{\mbox{\tiny{\rm$\chi^{2}$}}}(\mu)\bigr{)},\quad t\geq 0.\end{split}

∥ f ∥_{μ}^{2} \leq \frac{1}{λ} D^{μ} (f) .

∥ f ∥_{μ}^{2} \leq \frac{1}{λ} D^{μ} (f) .

D^{μ} (t f_{0} + (1 - t) f_{1}) + t (1 - t) D^{μ} (f_{0} - f_{1}) = t D^{μ} (f_{0}) + (1 - t) D^{μ} (f_{1}),

D^{μ} (t f_{0} + (1 - t) f_{1}) + t (1 - t) D^{μ} (f_{0} - f_{1}) = t D^{μ} (f_{0}) + (1 - t) D^{μ} (f_{1}),

∣ t v_{0} + (1 - t) v_{1} ∣^{2} + t (1 - t) ∣ v_{0} - v_{1} ∣^{2} = t ∣ v_{0} ∣^{2} + (1 - t) ∣ v_{1} ∣^{2}, \forall v_{0}, v_{1} \in V, \forall t \in [0, 1] .

∣ t v_{0} + (1 - t) v_{1} ∣^{2} + t (1 - t) ∣ v_{0} - v_{1} ∣^{2} = t ∣ v_{0} ∣^{2} + (1 - t) ∣ v_{1} ∣^{2}, \forall v_{0}, v_{1} \in V, \forall t \in [0, 1] .

D^{\mu}\bigl{(}tf_{0}+(1-t)f_{1}\bigr{)}+\lambda t(1-t)\lVert f_{0}-f_{1}\rVert_{\mu}^{2}\leq tD^{\mu}(f_{0})+(1-t)D^{\mu}(f_{1}),

D^{\mu}\bigl{(}tf_{0}+(1-t)f_{1}\bigr{)}+\lambda t(1-t)\lVert f_{0}-f_{1}\rVert_{\mu}^{2}\leq tD^{\mu}(f_{0})+(1-t)D^{\mu}(f_{1}),

∥ f ∥_{μ}^{2} \leq \frac{1}{λ} D^{μ} (f)

∥ f ∥_{μ}^{2} \leq \frac{1}{λ} D^{μ} (f)

f_{0} := \frac{1}{r} f^{-}, f_{1} := \frac{1}{r} f^{+}, t = 1/2.

f_{0} := \frac{1}{r} f^{-}, f_{1} := \frac{1}{r} f^{+}, t = 1/2.

- Δ_{g}^{μ} f := - \frac{1}{Z} div_{g} (e^{- ϕ - ψ} \nabla_{g} f),

- Δ_{g}^{μ} f := - \frac{1}{Z} div_{g} (e^{- ϕ - ψ} \nabla_{g} f),

λ_{2} := f \in L^{2} (M, μ) min \frac{D _{μ} ( f )}{∥ f - f _{μ} ∥ _{μ}^{2}},

λ_{2} := f \in L^{2} (M, μ) min \frac{D _{μ} ( f )}{∥ f - f _{μ} ∥ _{μ}^{2}},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

The Bayesian update: variational formulations and gradient flows

Nicolas Garcia Trilloslabel=e1][email protected] [

Daniel Sanz-Alonsolabel=e2][email protected] [ Department of Statistics, University of Wisconsin Madison

Department of Statistics, University of Chicago

(0000)

Abstract

The Bayesian update can be viewed as a variational problem by characterizing the posterior as the minimizer of a functional. The variational viewpoint is far from new and is at the heart of popular methods for posterior approximation. However, some of its consequences seem largely unexplored. We focus on the following one: defining the posterior as the minimizer of a functional gives a natural path towards the posterior by moving in the direction of steepest descent of the functional. This idea is made precise through the theory of gradient flows, allowing to bring new tools to the study of Bayesian models and algorithms. Since the posterior may be characterized as the minimizer of different functionals, several variational formulations may be considered. We study three of them and their three associated gradient flows. We show that, in all cases, the rate of convergence of the flows to the posterior can be bounded by the geodesic convexity of the functional to be minimized. Each gradient flow naturally suggests a nonlinear diffusion with the posterior as invariant distribution. These diffusions may be discretized to build proposals for Markov chain Monte Carlo (MCMC) algorithms. By construction, the diffusions are guaranteed to satisfy a certain optimality condition, and rates of convergence are given by the convexity of the functionals. We use this observation to propose a criterion for the choice of metric in Riemannian MCMC methods.

,

62F15,

49N99,

Gradient flows, Wasserstein Space, Convexity, Riemannian MCMC,

doi:

0000

keywords:

[class=MSC]

keywords:

††volume: 00††issue: 0

\startlocaldefs\endlocaldefs

, and

1 Introduction

In this paper we revisit the old idea of viewing the posterior as the minimizer of an energy functional. The use of variational formulations of Bayes rule seems to have been largely focused on one of its methodological benefits: restricting the minimization to a subclass of measures is the backbone of variational Bayes methods for posterior approximation. Our aim is to bring attention to two other theoretical and methodological benefits, and to study in some detail one of these: namely, that each variational formulation suggests a natural path, defined by a gradient flow, towards the posterior. We use this observation to propose a criterion for the choice of metric in Riemannian MCMC methods.

Let us recall informally a variational formulation of Bayes rule. Given a prior $p(u)$ on an unknown parameter $u$ and a likelihood function $L(y|u),$ the posterior $p(u|y)\propto L(y|u)p(u)$ can be characterized Zellner (1988) as the distribution $q^{*}(u)$ that minimizes

[TABLE]

Indeed, minimizing $J_{\mbox{\tiny{\rm KL}}}\bigl{(}q(u)\bigr{)}$ is equivalent to minimizing the Kullback-Leibler divergence $D_{\mbox{\tiny{\rm KL}}}(q(u)\|p(u|y)),$ and so clearly the minimizer is $q^{*}(u)=p(u|y).$ Other variational formulations may be considered by minimizing, for instance, other divergences rather than Kullback-Leibler. In this paper we consider three variational formulations of the Bayesian update. The first two characterize the posterior measure as the minimizer of functionals $J_{\mbox{\tiny{\rm KL}}}$ or $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ constructed by penalizing deviations from the prior measure in Kullback-Leibler or $\chi^{2}$ divergence —definitions of these functionals and divergences are given in equations (3.2), (3.5), (2.2), (2.3). The third one characterizes the posterior density as the minimizer of the Dirichlet energy $D^{\mu}$ —see (2.4).

Why is it useful to view the posterior $p(u|y)$ as the minimizer of an energy? We list below three advantages of this viewpoint, the third of which will be the focus of our paper.

The variational formulation provides a natural way to approximate the posterior by restricting the minimization problem to distributions $q(u)$ satisfying some computationally desirable property. For instance, variational Bayes methods often restrict the minimization to $q(u)$ with product structure Attias (1999), Wainwright and Jordan (2008), Fox and Roberts (2012). A similar idea is studied in Pinski et al. (2015), where $q(u)$ is restricted to a class of Gaussian distributions. An iterative variational procedure that progressively improves the posterior approximation by enriching the family of distributions was introduced in Guo et al. (2016). 2. 2.

If the prior $p_{\varepsilon_{n}}(u)$ or the likelihood $L_{\varepsilon_{n}}(y|u)$ depend on a parameter $\epsilon_{n},$ then the variational formulation allows to show large $n$ convergence of posteriors $p_{\varepsilon_{n}}(u|y)$ by establishing the $\Gamma$ -convergence of the associated energies. This method of proof has been employed by the authors in Garcia Trillos and Sanz-Alonso (2018), Garcia Trillos et al. (2017b) to analyze the large-data consistency of graph-based Bayesian semi-supervised learning. 3. 3.

Each variational formulation gives a natural path, defined by a gradient flow, towards the posterior. These flows can be thought of as time-parameterized curves in the space of probability measures, converging in the large-time limit towards the posterior.

In this paper we study three gradient flows associated with the variational formulations defined by minimization of the functionals $J_{\mbox{\tiny{\rm KL}}}$ , $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ , and $D^{\mu}.$ For intuition, we recall that a gradient flow in Euclidean space defines a curve whose tangent always points in the direction of steepest descent of a given function –see equation (2.5). In the same fashion, a gradient flow in a more general metric space can be thought of as a curve on said space that always points in the direction of steepest descent of a given functional Ambrosio et al. (2008). In Euclidean space the direction of steepest descent is naturally defined as that in which an Euclidean infinitesimal increment leads to the largest decrease on the value of the function. In a general metric space the direction of steepest descent is the one in which an infinitesimal increment defined in terms of the distance leads to the largest decrease on the value of the functional. In this paper we study:

(i)

The gradient flows defined by $J_{\mbox{\tiny{\rm KL}}}$ and $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ in the space of probability measures with finite second moments endowed with the Wasserstein distance—definitions are given in (2.1). By construction, these flows give curves of probability measures that evolve following the direction of steepest descent of $J_{\mbox{\tiny{\rm KL}}}$ and $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ in Wasserstein distance, converging to the posterior measure in the large-time limit. 2. (ii)

The gradient flow defined by the Dirichlet energy $D^{\mu}$ in the space of square integrable densities endowed with the $L^{2}$ distance. By construction, this flow gives a curve of densities in $L^{2}$ that evolves following the direction of steepest descent of $D^{\mu}$ in $L^{2}$ distance, converging to the posterior density in the large-time limit. Interestingly, the curve of measures associated with these densities is the exact same as the curve defined by the $J_{\mbox{\tiny{\rm KL}}}$ flow on Wasserstein space Jordan et al. (1998).

A question arises: what is the rate of convergence of these flows to the posterior? The answer is, to a large extent, provided by the theory of optimal transport and gradient flows Ambrosio et al. (2008), Villani (2003), Villani (2008), Santambrogio (2015). We will review and provide a unified account of these results in the main body of the paper, section 3. In the remainder of this introduction we discuss how rates of convergence may be studied in terms of convexity of functionals, and how these rates may be used as a guide for the choice of proposals for MCMC methods.

Rates of convergence of the flows hinge on the convexity level of each of the functionals $J_{\mbox{\tiny{\rm KL}}},$ $J_{\mbox{\tiny{\rm$ \chi^{2} $}}},$ and $D^{\mu}.$ Recalling the Euclidean case may be helpful: gradient descent on a highly convex function will lead to fast convergence to the minimizer. What is, however, a sensible notion of convexity for functionals defined over measures or densities? Our presentation highlights that the notion of geodesic (or displacement) convexity McCann (1997) nicely unifies the theory: it guarantees the existence and uniqueness of the three gradient flows and it also provides a bound on their rate of convergence to the posterior. In the $L^{2}$ setting one can show that positive geodesic convexity is equivalent to the posterior satisfying a Poincaré inequality, and also to the existence of a spectral gap —see subsection 3.2.2. On the other hand, the geodesic convexity of $J_{\mbox{\tiny{\rm KL}}}$ and $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ in Wasserstein space is determined by the Ricci curvature of the manifold, as well as by the likelihood function and prior density —see (3.2), (3.3), (3.5), (3.6). Typically the three functionals $J_{\mbox{\tiny{\rm KL}}},$ $J_{\mbox{\tiny{\rm$ \chi^{2} $}}},$ and $D^{\mu}$ will have different levels of geodesic convexity, and establishing a sharp bound on each of them may not be equally tractable.

The theory of gradient flows and optimal transport gives, for each of the flows, an associated Fokker-Planck partial differential equation (PDE) that governs the evolution of densities Ohta and Takatsu (2011), Santambrogio (2015). Such PDEs are typically costly to discretize if the parameter space is of moderate or high dimension, but they may be used in small-dimensional problems as a way to define tempering schemes. Here we do not explore this idea any further. Instead, we focus on the (nonlinear) diffusion processes associated with the PDEs. These diffusions are Langevin-type stochastic differential equations, whose evolving densities satisfy the Fokker-Planck equations. By construction, the invariant distribution of each of these diffusions is the sought posterior, and a bound on the rate of convergence of the diffusions to the posterior is given by the geodesic convexity of the corresponding functional. The gradient flow perspective automatically gives a sense in which the diffusions are optimal: the associated densities move locally (in Wasserstein or $L^{2}$ sense) in the direction of steepest descent of the functional. From this it immediately follows, for instance, that the law of a standard Langevin diffusion in Euclidean space evolves locally in Wasserstein space in the direction that minimizes Kullback-Leibler, and that it also evolves locally in $L^{2}$ in the direction that minimizes the Dirichlet energy.

The MCMC methodology allows to use a proposal based on a discretization of a diffusion —combined with an accept-reject mechanism to remove the discretization bias— to produce, in the large-time asymptotic, correlated posterior samples. Heuristically, the rate of convergence of the un-discretized diffusion may guide the choice of proposal. Proposals based on Langevin diffusions were first suggested in Besag (1994), and the exponential ergodicity of the resulting algorithms was analysed in Roberts and Tweedie (1996). The paper Girolami and Calderhead (2011) considered changing the metric on the parameter space in order to accelerate MCMC algorithms by taking into account the geometric structure that the posterior defines in the parameter space. This led to a new family of Riemannian MCMC algorithms. Our paper is concerned with the study of un-discretized diffusions; the effect of the accept-reject mechanism on rates and ergodicity of MCMC methods will be studied elsewhere. We suggest that a way to guide the choice of metric of Riemannian MCMC methods is to choose the one that leads to a faster rate of convergence of the diffusion under certain constraints. We emphasize that despite working with un-discretized diffusions, our guidance for choice of proposals accounts for the fact that discretization will eventually be needed. Our criterion weeds out choices of metric that lead to diffusions that achieve fast rate of convergence by merely speeding-up the drift. This is crucial, since a larger drift typically leads to a larger discretization error, and therefore to more rejections in the MCMC accept-reject mechanism in order to remove the bias in the discrete chain. This important constraint on the size of the drift seems to have been overlooked in existing continuous-time analyses of MCMC methods.

In summary, the following points highlight the key elements and common structure of the variational formulations of the Bayesian update and of the study of the associated gradient flows:

•

The posterior can be characterized as the minimizer of different functionals on probability measures or densities.

•

One can then study the gradient flows of these functionals with respect to a metric on the space of probability measures or densities; the resulting curve is a curve of maximal slope and its endpoint is the posterior.

•

The gradient flows are characterized by a Fokker-Planck PDE that governs the evolution of the density of an associated diffusion process.

•

By studying the convexity of the functionals (with respect to a given metric) one can obtain rates of convergence of the gradient flows towards the posterior. In particular, the level of convexity determines the speed of convergence of the densities of the associated diffusion process towards the posterior, and hence can be used as a criterion to guide the choice of proposals for MCMC methods; here we emphasize that care must be taken when comparing different diffusions if a higher speed of convergence is at the cost of a more expensive discretization.

The ideas in this paper immediately extend beyond the Bayesian interpretation stressed here to any application (e.g. the study of conditioned diffusions) where a measure of interest is defined in terms of a reference measure and a change of measure. Also, we consider only Kullback-Leibler and $\chi^{2}$ prior penalizations to define the functionals $J_{\mbox{\tiny{\rm KL}}}$ and $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ , but it would be possible to extend the analysis to the family of $m$ -divergences introduced in Ohta and Takatsu (2011). Kullback-Leibler and $\chi^{2}$ prior penalization correspond to $m\to 1$ and $m=2$ within this family. In what follows we point out some of the features of the different functionals and gradient flows that we consider in this paper.

1.1 Comparison of Functionals and Flows

We now provide a comparison of the three choices of functionals that we consider.

The two gradient flows in Wasserstein space (arising from the functionals $J_{\mbox{\tiny{\rm KL}}}$ and $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ ) are fundamentally connected with the variational formulation: these variational formulations can be used to define posterior-type measures via a penalization of deviations from the prior and deviations from the data in situations where establishing the existence of conditional distributions by disintegration of measures is technically demanding. On the other hand, the variational formulation for the Dirichlet energy is less natural and requires previous knowledge of the posterior. 2. 2.

The precise level of geodesic convexity of the functionals $J_{\mbox{\tiny{\rm KL}}}$ (and $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ ) can be computed from point evaluation of the Ricci tensor (of the parameter space) and derivatives of the densities. In particular, knowledge of the underlying metric suffices to compute these quantities. In contrast, establishing a sharp Poincaré inequality —the level of geodesic convexity of the Dirichlet energy in $L^{2}(\mathcal{M},\mu)$ — is in practice unfeasible, as it effectively requires solving an infinite dimensional optimization problem. It is for this reason —and because of the explicit dependence of the convexity in Wasserstein space with the geometry induced by the manifold metric tensor— that our analysis of the choice of metric in Riemannian MCMC methods is based on the $J_{\mbox{\tiny{\rm KL}}}$ functional (see section 4, and in particular Theorem 4.1). 3. 3.

On the flip side of point 2, a Poincaré inequality for the posterior with a not necessarily optimal constant can be established using only tail information. In particular, even when the functional $J_{\mbox{\tiny{\rm KL}}}$ is not geodesically convex in Wasserstein space, one may still be able to obtain a Poincaré inequality (see subsection 5.2 for an example). 4. 4.

In contrast to the diffusions arising from the $J_{\mbox{\tiny{\rm KL}}}$ or Dirichlet flows, the stochastic processes arising from the $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ formulation are inhomogeneous, and hence simulation seems more challenging unless further structure is assumed on the prior measure and likelihood function. Also, the evolution of densities of the gradient flow of $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ in Wasserstein space is given by a porous medium PDE.

1.2 Outline

The rest of the paper is organized as follows. Section 2 contains some background material on the Wasserstein space, geodesic convexity of functionals, and gradient flows in metric spaces. The core of the paper is section 3, where we study the geodesic convexity, PDEs, and diffusions associated with each of the three functionals $J_{\mbox{\tiny{\rm KL}}},$ $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ , and $D^{\mu}.$ In section 4 we consider an application of the theory to the choice of metric in Riemannian MCMC methods Girolami and Calderhead (2011), and in section 5 we illustrate the main concepts and ideas through examples arising in Bayesian formulations of semi-supervised learning Garcia Trillos and Sanz-Alonso (2018), Garcia Trillos et al. (2017b), Bertozzi et al. (2017). We close in section 6 by summarizing our main contributions and pointing to open directions.

1.3 Set-up and Notation

$(\mathcal{M},g)$ will denote a smooth connected $m$ -dimensional Riemannian manifold with metric tensor $g$ representing the parameter space. We will denote by $d$ the associated Riemannian distance, and assume that $(\mathcal{M},d)$ is a complete metric space. By the Hopf-Rinow theorem it follows that $\mathcal{M}$ is a * geodesic space* —we refer to subsection (2.1) for a discussion on geodesic spaces and their relevance here. We denote by $vol_{g}$ the associated volume form. To emphasize the dependence of differential operators on the metric with which $\mathcal{M}$ is endowed, we write $\nabla_{g}$ , $\text{div}_{g},$ $\text{Hess}_{g}$ and $\Delta_{g}$ for the gradient, divergence, Hessian, and Laplace Beltrami operators on $(\mathcal{M},g).$ The reader not versed in Riemannian geometry may focus on the case $\mathcal{M}=\mathbb{R}^{m}$ with the usual metric tensor, in which case $d$ is the Euclidean distance and $dvol_{g}=dx$ is the Lebesgue measure. However, in section 4 where we discuss applications to Riemannian MCMC, we endow $\mathbb{R}^{m}$ with a general metric tensor $g$ and hence familiarity with some notions from differential geometry is desirable.

We denote by $\mathcal{P}(\mathcal{M})$ the space of probability measures on $\mathcal{M}$ (endowed with the Borel $\sigma$ -algebra). We will be concerned with the update of a prior probability measure $\pi\in\mathcal{P}(\mathcal{M})$ —that represents various degrees of belief on the value of a quantity or parameter of interest— into a posterior probability measure $\mu\in\mathcal{P}(\mathcal{M})$ , based on observed data $y$ . We will assume that the prior is defined as a change of measure from $vol_{g},$ and that the posterior is defined as a change of measure from $\pi$ as follows:

[TABLE]

The data is incorporated in the Bayesian update through the negative log-likelihood function $\phi(\cdot)=\phi(\cdot;y).$

2 Preliminaries

In this section we provide some background material. The Wasserstein space, and the notion of $\lambda$ -geodesic convexity of functionals are reviewed in subsection 2.1. Gradient flows in metric spaces are reviewed in subsection 2.2.

2.1 Geodesic Spaces and Geodesic Convexity of Functionals

A geodesic space $(X,d_{X})$ is a metric space with a notion of length of curves that is compatible with the metric, and where every two points in the space can be connected by a curve whose length achieves the distance between the points (see Burago et al. (2001) for more details). Geodesic spaces constitute a large family of metric spaces with a rich theory of gradient flows. Here we consider three geodesic spaces. First, the base space $(\mathcal{M},d)$ , i.e. the manifold $\mathcal{M}$ equipped with its Riemannian distance. Second, the space $\mathcal{P}_{2}(\mathcal{M})$ of square integrable Borel probability measures defined on $\mathcal{M}$ , endowed with the Wasserstein distance $\mathcal{W}_{2}$ . Third, the space of functions $f\in L^{2}(\mathcal{M},\mu)$ , with $\int_{\mathcal{M}}fd\mu=1,$ equipped with the $L^{2}(\mathcal{M},\mu)$ norm.

We spell out the definitions of $\mathcal{P}_{2}(\mathcal{M})$ and $\mathcal{W}_{2}$ :

[TABLE]

The infimum in the previous display is taken over all transportation plans between $\nu_{1}$ and $\nu_{2}$ , i.e. over $\alpha\in\mathcal{P}(\mathcal{M}\times\mathcal{M})$ with marginals $\nu_{1}$ and $\nu_{2}$ on the first and second factors. The space $(\mathcal{P}_{2}(\mathcal{M}),\mathcal{W}_{2})$ is indeed a geodesic space: geodesics in $(\mathcal{P}_{2}(\mathcal{M}),\mathcal{W}_{2})$ are induced by those in $(\mathcal{M},d)$ . All it takes to construct a geodesic connecting $\nu_{0}\in\mathcal{P}_{2}(\mathcal{M})$ and $\nu_{1}\in\mathcal{P}_{2}(\mathcal{M})$ is to find an optimal transport plan between $\nu_{0}$ and $\nu_{1}$ to determine source locations and target locations, and then transport the mass along geodesics in $\mathcal{M}$ (see Villani (2003) and Santambrogio (2015)).

The space of functions $f\in L^{2}(\mathcal{M},\mu)$ , with $\int_{\mathcal{M}}fd\mu=1,$ equipped with the $L^{2}(\mathcal{M},\mu)$ norm is also a geodesic space, where a constant speed geodesic connecting $f_{0}$ and $f_{1}$ is given by linear interpolation: $t\in[0,1]\mapsto(1-t)f_{0}+tf_{1}$ .

We will consider several functionals $E:X\to\mathbb{R}\cup\{\infty\}$ throughout the paper. They will all be defined in one of our three geodesic spaces —that is, $X=\mathcal{M}$ , $X=\mathcal{P}_{2}(\mathcal{M})$ or $X=L^{2}(\mathcal{M},\mu)$ . Important examples will be, respectively:

Functions $\Psi:\mathcal{M}\to\mathbb{R}\cup\{\infty\}.$ 2. 2.

The Kullback-Leibler and $\chi^{2}$ divergences $D_{\mbox{\tiny{\rm KL}}}(\cdot\|\pi),$ $D_{\mbox{\tiny{\rm$ \chi^{2} $}}}(\cdot\|\pi):\mathcal{P}(\mathcal{M})\to[0,\infty],$ where $\pi$ is a given (prior) measure and, for $\nu_{1},\nu_{2}\in\mathcal{P}(\mathcal{M}),$

[TABLE]

and the potential-type functional $J:\mathcal{P}(\mathcal{M})\to\mathbb{R}\cup\{\infty\}$ given by

[TABLE]

where $h$ is a given potential function. 3. 3.

The Dirichlet energy $D^{\mu}:L^{2}(\mathcal{M},\mu)\rightarrow[0,\infty]$ defined by

[TABLE]

Recall that here and throughout, $\nabla_{g}$ denotes the gradient in $(\mathcal{M},g)$ and $\lVert\cdot\rVert$ is the norm on each tangent space $\mathcal{T}_{x}\mathcal{M}$ .

A crucial unifying concept will be that of $\lambda$ -geodesic convexity of functionals. We recall it here:

Definition 2.1.

Let $(X,d_{X})$ be a geodesic space and let $\lambda\in\mathbb{R}$ . A functional $E:X\to\mathbb{R}\cup\{\infty\}$ is called $\lambda$ -geodesically convex provided that for any $x_{0},x_{1}\in X$ there exists a constant speed geodesic $t\in[0,1]\mapsto\gamma(t)\in X$ such that $\gamma(0)=x_{0},$ $\gamma(1)=x_{1},$ and

[TABLE]

The following remark characterizes the $\lambda$ -convexity of functionals when $X=\mathcal{M}$ .

Remark 2.2.

Let $\Psi\in C^{2}(\mathcal{M})$ so that we can define its Hessian at all points in $\mathcal{M}$ (see the proof of Theorem 4.1 in the appendix for the definition). Then the following conditions are equivalent:

(i)

$\Psi$ * is $\lambda$ -geodesically convex.* 2. (ii)

$\text{\emph{Hess}}_{g}\,\Psi_{x}(v,v)\geq\lambda$ * for all $x\in\mathcal{M}$ and all unit vectors $v\in T_{x}\mathcal{M}.$ *

If $(\mathcal{M},d)$ is the Euclidean space, (i) and (ii) are also equivalent to:

(iii)

$\Psi-\frac{\lambda}{2}\lvert\cdot\rvert^{2}$ * is a convex function.*

This latter condition is known in the optimization literature as strong convexity.

2.2 Gradient Flows in Metric Spaces

In this subsection we review the basic concepts needed to define gradient flows in a metric space $(X,d_{X})$ . We follow Chapter 8 of Santambrogio (2015); a standard technical reference is Ambrosio et al. (2008).

To guide the reader, we first recall the formulation of gradient flows in Euclidean space, where $X=\mathbb{R}^{d}$ and $d_{X}$ is the Euclidean metric. Let $E:\mathbb{R}^{d}\rightarrow\mathbb{R}$ be a differentiable function, and consider the equation

[TABLE]

Then, the solution $x$ to (2.5) is the gradient flow of $E$ in Euclidean space with initial condition $x_{0}$ ; it is a curve whose tangent vector at every point in time is the negative of the gradient of the function $E$ at that time. In order to generalize the notion of a gradient flow to functionals defined on more general metric spaces, and in particular when the metric space has no differential structure, we reformulate (2.5) in integral form by using that $\frac{d}{dt}E\bigl{(}x(t)\bigr{)}=\langle\nabla E\bigl{(}x(t)\bigr{)},\dot{x}(t)\rangle=-\frac{1}{2}|\dot{x}(t)|^{2}-\frac{1}{2}|\nabla E\bigr{(}x(t)\bigl{)}|^{2}$ as follows:

[TABLE]

This identity, known as energy dissipation equality, is equivalent to (2.5) —see Chapter 8 of Santambrogio (2015) for further details and other possible formulations. Crucially (2.6) involves notions that can be defined in an arbitrary metric space $(X,d_{X})$ : the metric derivative of a curve $t\mapsto x(t)\in X$ is given by

[TABLE]

and the slope of a functional $E:X\to\mathbb{R}\cup\{\infty\}$ is defined as the map $|\nabla E|:\{x\in X:E(x)<\infty\}\to\mathbb{R}\cup\{\infty\}$ given by

[TABLE]

The identity (2.6) is the standard way to introduce gradient flows in arbitrary metric spaces. In this paper we consider gradient flows in $L^{2}$ and Wasserstein spaces, where the notion of tangent vector is available. $L^{2}$ has Hilbert space structure, whereas the Wasserstein space can be seen as an infinite dimensional manifold (see Ambrosio et al. (2008), Santambrogio (2015)).

3 Variational Characterizations of the Posterior and Gradient Flows

In this section we lay out the main elements of the theory of variational formulations and gradient flows in regards to the Bayesian update. Subsection 3.1 details three variational formulations defined in terms of the functionals $J_{\mbox{\tiny{\rm KL}}}$ , $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ and the Dirichlet energy $D^{\mu}$ . Subsection 3.2 studies the geodesic convexity of $J_{\mbox{\tiny{\rm KL}}}$ and $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ in Wasserstein space and of $D^{\mu}$ in $L^{2}$ . Finally, subsection 3.3 collects the PDEs that characterize the gradient flows, as well as the corresponding diffusion processes.

3.1 Variational Formulation of the Bayesian Update

The variational formulation of the posterior as the minimizer of $J_{\mbox{\tiny{\rm KL}}}$ and $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ share the same structure and will be outlined first. The variational formulation in terms of the Dirichlet energy will be given below.

3.1.1 The Functionals $J_{\mbox{\tiny{\rm KL}}}$ and $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$

In mathematical analysis Jordan and Kinderlehrer (1996) and probability theory Dupuis and Ellis (2011) it is often useful to note that a probability measure $\mu$ defined by

[TABLE]

is the minimizer of the functional

[TABLE]

where

[TABLE]

and the integral is interpreted as $+\infty$ if $\phi$ is not integrable with respect to $\nu$ . In physical terms, the Kullback-Leibler divergence represents an internal energy, $F_{\mbox{\tiny{\rm KL}}}$ represents a potential energy, and the constant $Z$ is known as the partition function. Here we are concerned with a statistical interpretation of equation (3.1), and view it as defining a posterior measure as a change of measure from a prior measure. In this context, the Kullback-Leibler term $D_{\mbox{\tiny{\rm KL}}}(\cdot\|\pi)$ in (3.2) represents a penalization of deviations from prior beliefs, the term $F_{\mbox{\tiny{\rm KL}}}(\nu;\phi)$ penalizes deviations from the data, and the normalizing constant $Z$ represents the marginal likelihood. For brevity, we will henceforth suppress the data $y$ from the negative log-likelihood function $\phi$ , writing $\phi(u)$ instead of $\phi(u;y).$

We remark that the fact that $\mu$ minimizes $J_{\mbox{\tiny{\rm KL}}}$ follows immediately from the identity

[TABLE]

Minimizing $J_{\mbox{\tiny{\rm KL}}}(\cdot)$ or $D_{\mbox{\tiny{\rm KL}}}(\cdot\|\mu)$ is thus equivalent, but the functional $J_{\mbox{\tiny{\rm KL}}}$ makes apparent the roles of the prior and the likelihood.

The posterior $\mu$ also minimizes the functional

[TABLE]

where

[TABLE]

We refer to Ohta and Takatsu (2011) for details. Note that both $J_{\mbox{\tiny{\rm KL}}}$ and $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ are defined in terms of the two starting points of the Bayesian update: the prior $\pi$ and the negative log-likelihood $\phi.$ The associated variational formulations suggest a way to define posterior-type measures based on these two ingredients in scenarios where establishing the existence of conditional distributions via desintegration of measures is technically demanding. This appealing feature of the two variational formulations above is not shared by the one described in the next subsection.

3.1.2 The Dirichlet Energy $D^{\mu}$

Let now the posterior $\mu$ be given, and consider the space $L^{2}(\mathcal{M},\mu)$ of functions defined on $\mathcal{M}$ which are square integrable with respect to $\mu$ . Recall the Dirichlet energy

[TABLE]

introduced in equation (2.4). Now, since the measure $\mu$ can be characterized as the probability measure with density $\rho_{\mu}\equiv 1$ a.s. with respect to $\mu,$ it follows that the posterior density $\rho_{\mu}\equiv 1$ is the minimizer of the Dirichlet energy $D^{\mu}$ over probability densities $\rho\in L^{2}(\mathcal{M},\mu)$ with $\int_{\mathcal{M}}\rho d\mu=1.$

3.2 Geodesic Convexity and Functional Inequalities

In this section we study the geodesic convexity of the functionals $J_{\mbox{\tiny{\rm KL}}}$ , $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ , and $D^{\mu}$ . The geodesic convexity of $J_{\mbox{\tiny{\rm KL}}}$ and $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ in Wasserstein space is considered first, and will be followed by the geodesic convexity of $D^{\mu}$ in $L^{2}$ . We will show the equivalence of the latter to the posterior satisfying a Poincaré inequality.

3.2.1 Geodesic Convexity of $J_{\mbox{\tiny{\rm KL}}}$ and $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$

The next proposition can be found in von Renesse and Sturm (2005) and Sturm (2006). It shows that the convexity of $J_{\mbox{\tiny{\rm KL}}}$ can be determined by the so-called curvature-dimension condition —a condition that involves the curvature of the manifold and the Hessian of the combined change of measure $\Psi+\phi.$ We recall the notation $\pi=e^{-\Psi}vol_{g}$ and $\mu\propto e^{-\phi}\pi$ .

Proposition 3.1.

Suppose that $\Psi,\phi\in C^{2}(\mathcal{M}).$ Then $J_{\mbox{\tiny{\rm KL}}}$ (or $D_{\mbox{\tiny{\rm KL}}}(\cdot\|\mu)$ ) is $\lambda$ -geodesically convex if, and only if,

[TABLE]

where $\text{Ric}_{g}$ denotes the Ricci curvature tensor.

We recall that the Ricci curvature provides a way to quantify the disagreement between the geometry of a Riemannian manifold and that of ordinary Euclidean space. The Ricci tensor is defined as the trace of a map involving the Riemannian curvature (see do Carmo Valero (1992)).

The following example illustrates the geodesic convexity of $D_{\mbox{\tiny{\rm KL}}}(\cdot\|\mu)$ for Gaussian $\mu.$

Example 1.

Let $\mu=N(\theta,\Sigma)$ be a Gaussian measure in $\mathbb{R}^{m}$ (endowed with the Euclidean metric), with $\Sigma$ positive definite. Then $D_{\mbox{\tiny{\rm KL}}}(\cdot\|\mu)$ is $1/\Lambda_{\max}(\Sigma)$ -geodesically convex, where $\Lambda_{\max}(\Sigma)$ is the largest eigenvalue of $\Sigma.$ This follows immediately from the above, since here $\Psi(x)=\frac{1}{2}\langle x-\theta,\Sigma^{-1}(x-\theta)\rangle,$ and the Euclidean space is flat (its Ricci curvature is identically equal to zero). Note that the level of convexity of the functional depends only on the largest eigenvalue of the covariance, but not on the dimension $m$ of the underlying space.

The $\lambda$ -convexity of $J_{\mbox{\tiny{\rm KL}}}$ guarantees the existence of the gradient flow of $J_{\mbox{\tiny{\rm KL}}}$ in Wasserstein space. Moreover, it determines the rate of convergence towards the posterior $\mu.$ Precisely, if $\mu_{0}$ is absolutely continuous with respect to $\mu,$ and if $\lambda>0$ , then the gradient flow $t\in[0,\infty)\mapsto\mu_{t}$ of $J_{\mbox{\tiny{\rm KL}}}$ with respect to the Wasserstein metric starting at $\mu_{0}$ is well defined and we have:

[TABLE]

The second inequality, known as Talagrand inequality Villani (2003), establishes a comparison between Wasserstein geometry and information geometry. It can be established directly combining the $\lambda$ -geodesic convexity of $J_{\mbox{\tiny{\rm KL}}}$ (for positive $\lambda$ ) with the first inequality. From (3.7) we see that a higher level of convexity of $J_{\mbox{\tiny{\rm KL}}}$ allows to guarantee a faster rate of convergence towards the posterior distribution $\mu$ .

We now turn to the geodesic convexity properties of $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}.$ We recall that $m$ denotes the dimension of the manifold $\mathcal{M}.$ The following proposition can be found in (Ohta and Takatsu, 2011, Theorem 4.1).

Proposition 3.2.

$J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ * is $\lambda$ -geodesically convex if and only if both of the following two properties are satisfied:*

$\text{\emph{Ric}}_{g}\,(v,v)+\text{\emph{Hess}}_{g}\,\Psi(v,v)+\frac{1}{m+1}\langle\nabla_{g}\Psi,v\rangle^{2}\geq 0,\quad\forall x\in\mathcal{M},\quad\forall v\in T_{x}\mathcal{M}.$ ** 2. 2.

$\phi$ * is $\lambda$ -geodesically convex as a real valued function defined on $\mathcal{M}$ .*

There are two main conclusions we can extract from the previous proposition. First, that condition 1) is only related to the prior distribution $\pi$ whereas condition 2) is only related to the likelihood; in particular, the convexity properties of $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ can indeed be studied by studying separately the prior and the likelihood (notice that the proposition gives an equivalence). Secondly, notice that condition 1) is a qualitative property and if it is not met there is no hope that the functional $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ has any level of global convexity even when the likelihood function is a highly convex function. In addition, if 1) is satisfied, the convexity of $\phi$ determines completely the level of convexity of $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ . These features are markedly different from the ones observed in the Kullback-Leibler case.

As for the functional $J_{\mbox{\tiny{\rm KL}}}$ , one can establish the following functional inequalities, under the assumption of $\lambda$ -geodesic convexity of $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ for $\lambda>0$ :

[TABLE]

The above inequalities exhibit the fact that a higher level of convexity of $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ guarantees a faster convergence towards the posterior distribution $\mu$ .

3.2.2 Geodesic Convexity of Dirichlet Energy

We now study the geodesic convexity of the Dirichlet energy functional defined in equation (2.4). In what follows we denote by $\|\cdot\|$ the $L^{2}$ norm with respect to $\mu.$ Let us start recalling Poincaré inequality.

Definition 3.3.

We say that a Borel probability measure $\mu$ on $\mathcal{M}$ has a Poincaré inequality with constant $\lambda$ if for every $f\in L^{2}(\mathcal{M},\mu)$ satisfying $\int_{\mathcal{M}}fd\mu=0$ we have

[TABLE]

We now show that Poincaré inequalities are directly related to the geodesic convexity of the functional $D_{\mu}$ in the $L^{2}(\mathcal{M},\mu)$ space.

Proposition 3.4.

Let $\lambda$ be a positive real number and let $\mu$ be a Borel probability measure on $\mathcal{M}$ . Then, the measure $\mu$ has a Poincaré inequality with constant $\lambda$ if and only if the functional $D^{\mu}$ is $2\lambda$ -geodesically convex in the space of functions $f\in L^{2}(\mathcal{M},\mu)$ satisfying $\int fd\mu=1$ .

Proof.

First of all we claim that

[TABLE]

for all $f_{0},f_{1}\in L^{2}(\mathcal{M},\mu)$ and every $t\in[0,1]$ . To see this, it is enough to assume that both $D^{\mu}(f_{0})$ and $D^{\mu}(f_{1})$ are finite and then notice that equality (3.9) follows from the easily verifiable fact that for an arbitrary Hilbert space $V$ with induced norm $|\cdot|$ one has

[TABLE]

Now, suppose that $\mu$ has a Poincaré inequality with constant $\lambda$ and consider two functions $f_{0},f_{1}\in L^{2}(\mathcal{M},\mu)$ satisfying $\int_{\mathcal{M}}f_{0}d\mu=\int_{\mathcal{M}}f_{1}d\mu=1$ . Then, (3.9) combined with Poincaré inequality (taking $f:=f_{0}-f_{1}$ ) gives:

[TABLE]

which is precisely the $2\lambda$ -geodesic convexity condition for $D^{\mu}$ .

Conversely, suppose that $D^{\mu}$ is $2\lambda$ -geodesic convex in the space of $L^{2}(\mathcal{M},\mu)$ functions that integrate to one. Let $f\in L^{2}(\mathcal{M},\mu)$ be such that $\int_{\mathcal{M}}fd\mu=0$ and without the loss of generality assume that $D^{\mu}(f)<\infty$ and that $\|f\|_{\mu}\not=0$ . Under these conditions, the positive and negative parts of $f$ , $f^{+}$ and $f^{-}$ , satisfy $D^{\mu}(f^{+}),D^{\mu}(f^{-})<\infty$ and $\int_{\mathcal{M}}f^{+}d\mu=r=\int_{\mathcal{M}}f^{-}d\mu$ where $r>0$ . The inequality

[TABLE]

is obtained directly from (3.9) and (3.10) applied to

[TABLE]

∎

Remark 3.5.

It is well known that the best Poincaré constant for a measure $\mu$ is equal to the smallest non-trivial eigenvalue of the operator $-\Delta_{g}^{\mu}$ defined formally as

[TABLE]

where $\text{\emph{\text{div}}}_{g}$ and $\nabla_{g}$ are the divergence and gradient operators in $(\mathcal{M},g)$ . This eigenvalue can be written variationally as

[TABLE]

where

[TABLE]

Remark 3.6.

Spectral gaps are used in the theory of MCMC as a means to bound the asymptotic variance of empirical expectations Kipnis and Varadhan (1986).

Let us now consider $t\in(0,\infty)\mapsto\mu_{t}$ the flow of $D^{\mu}$ in $L^{2}(\mathcal{M},\mu)$ with some initial condition $\frac{d\mu_{0}}{d\mu}=\rho_{0}$ . It is well known that this flow coincides with that of the functional $J_{\mbox{\tiny{\rm KL}}}$ in Wasserstein space. However, taking the Dirichlet- $L^{2}$ point of view, one can use a Poincaré inequality (i.e. the geodesic convexity of $D^{\mu}$ ) to deduce the exponential convergence of $\mu_{t}$ towards $\mu$ in the $\chi^{2}$ -sense. Indeed, let

[TABLE]

A standard computation then shows that

[TABLE]

In the second equality we have used that $\frac{\partial\rho}{\partial t}=\Delta^{\mu}_{g}\rho,$ as discussed in subsection 3.3 below. Hence by Gronwall’s inequality, see e.g. Teschl (2012),

[TABLE]

3.3 PDEs and Diffusions

Here we describe the PDEs that govern the evolution of densities of the three gradient flows, and the stochastic processes associated with these PDEs. We consider first the flows defined with the functionals $J_{\mbox{\tiny{\rm KL}}}$ and $D^{\mu}$ and then the flow defined by the functional $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}.$

3.3.1 $J_{\mbox{\tiny{\rm KL}}}$ -Wasserstein and $D^{\mu}$ - $L^{2}(\mathcal{M},\mu)$

It was shown in Jordan et al. (1998) —in the Euclidean setting and in the unweighted case $\pi=dx$ — that the gradient flow of the Kullback-Leibler functional $D_{\mbox{\tiny{\rm KL}}}(\cdot\|\pi)$ in Wasserestein space produces a solution to the Fokker-Planck equation. More generally, under the convexity conditions guaranteeing the existence of the gradient flow $t\in(0,\infty)\mapsto\mu_{t}$ of $D_{\mbox{\tiny{\rm KL}}}(\cdot\|\mu)$ (equivalently of $J_{\mbox{\tiny{\rm KL}}}$ ) starting from $\mu_{0}\in\mathcal{P}(\mathcal{M})$ , the densities

[TABLE]

satisfy (formally) the following Fokker-Planck equations

[TABLE]

Equation (3.12) can be identified as the evolution of the densities (w.r.t. $dvol_{g}$ ) of the diffusion

[TABLE]

where $B^{g}$ denotes a Brownian motion defined on $(\mathcal{M},g)$ and $\nabla_{g}$ is the gradient on $(\mathcal{M},g)$ . Naturally, the $D^{\mu}$ flow in $L^{2}$ has the same associated Fokker-Planck equation (3.11) and diffusion process (3.13).

3.3.2 $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ -Wasserstein

The PDE satisfied (formally) by the densities

[TABLE]

of the $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ -Wasserstein flow $t\in(0,\infty)\mapsto\mu_{t}$ is the (weighted) porous medium equation:

[TABLE]

where the weighted Laplacian and divergence are defined formally as

[TABLE]

Consider now the stochastic process $\{u_{t}\}_{t\geq 0}$ formally defined as the solution to the nonlinear diffusion

[TABLE]

where $\tilde{\rho}$ is the solution to (3.14). Let $\theta_{t}$ be the evolution of the densities (with respect to $dvol_{g}$ ) of the above diffusion. Then a formal computation shows that $\theta$ satisfies the Fokker-Planck equation:

[TABLE]

If we let $\beta=\frac{1}{Z}\exp(-\Psi)\theta$ we see, using (3.15), that

[TABLE]

implying that the distributions of the stochastic process (3.16) are those generated by the gradient flow of $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ in Wasserstein space.

Remark 3.7.

In contrast with the Langevin diffusion (3.13), the process (3.16) is defined in terms of the solution of the equation satisfied by its densities. In particular, if one wanted to simulate (3.16) one would need to know the solution of (3.14) before hand.

4 Application: Sampling and Riemannian MCMC

So far we have treated the Riemannian manifold $(\mathcal{M},g)$ as fixed. In this section we take a different perspective and treat the metric $g$ as a free parameter. Precisely, we will now consider a family of gradient flows of the functional $J_{\mbox{\tiny{\rm KL}}}$ with respect to Wasserstein distances induced by different metrics $g$ on the parameter space. We do this motivated by the so called Riemannian MCMC methods for sampling, where a change of metric in the base space is introduced in order to produce Langevin-type proposals that are adapted to the geometric features of the target, thereby exploring regions of interest and accelerating the convergence of the chain to the posterior. There are different heuristics regarding the choice of metric (see Girolami and Calderhead (2011)), but no principled way to compare different metrics and rank their performance for sampling purposes. With the developments presented in this paper we propose one such principled criterion as we describe below. We restrict our attention to the case $\mathcal{M}=\mathbb{R}^{m}$ .

Let $g$ be a Riemannian metric tensor on $\mathbb{R}^{m}$ defined via

[TABLE]

where for every $x\in\mathbb{R}^{m}$ , $G(x)$ is a $m\times m$ positive definite matrix. In what follows we identify $g$ with $G$ and refer to both as ‘the metric’ and we use terms such as $g$ -geodesic, $g$ -Wassertein distance, etc. to emphasize that the notions considered are being constructed using the metric $g$ . Let $d_{g}$ be the distance induced by the metric tensor $g$ and let $vol_{g}$ be the associated volume form. Notice that in terms of the Lebesgue measure and the metric $G$ , we can write

[TABLE]

We use the canonical basis for $\mathbb{R}^{m}$ as global chart for $\mathbb{R}^{m}$ and consider the canonical vector fields $\frac{\partial}{\partial x_{1}},\dots,\frac{\partial}{\partial x_{m}}$ . The Christoffel symbols associated to the Levi-Civita connection of the Riemannian manifold $(\mathbb{R}^{m},g)$ can be written in terms of derivatives of the metric as

[TABLE]

where in the right hand-side —and in what follows— we use Einstein’s summation convention. The proof of the following result is in the Appendix.

Theorem 4.1.

Let $F\in C^{2}(\mathbb{R}^{m})$ and

[TABLE]

The sharp constant $\lambda$ for which $J_{\mbox{\tiny{\rm KL}}}$ (or $D_{\mbox{\tiny{\rm KL}}}(\cdot\|\mu)$ ) is $\lambda$ -geodesically convex in the $g$ -Wasserstein distance is equal to

[TABLE]

where $\text{\emph{Hess}}\,F$ is the usual (Euclidean) Hessian matrix of $F,$ $B$ is the matrix with coordinates

[TABLE]

and $C$ is the matrix with coordinates

[TABLE]

Moreover, for any $a>0$ ,

[TABLE]

Note that $\lambda_{G}$ is a key quantity in evaluating the quality of a metric $G$ in building geometry-informed Langevin diffusions for sampling purposes, as it gives the exponential rate at which the evolution of probabilities built using the metric $G$ converges towards the posterior: larger $\lambda_{G}$ corresponds to faster convergence. However, in order to establish a fair performance comparison, the metrics need to be scaled appropriately. Indeed a faster rate can be obtained by scaling down the metric (which can be thought of as time-rescaling), as it is clearly seen by the scaling property (4.4) of the functional $\lambda_{G}.$ It is important to note that scaling down the metric leads to a faster diffusion, but also makes its discretization more expensive. Indeed the error of Euler discretizations is largely influenced by the Lipschitz constant of the drift. This motivates that a fair criterion for choosing the metric could be to maximize $\lambda_{G}$ with the constraint

[TABLE]

since $\nabla_{g}F=G^{-1}\nabla F$ (where $\nabla$ denotes the standard Euclidean gradient) is the drift of the diffusion (3.13). Note that the constraint (4.5) ensures that the metric cannot be scaled down arbitrarily while also guaranteeing that the discretizations do not become increasingly expensive. We remark that other constraints involving higher regularity requirements may be useful if higher order discretizations are desired.

Remark 4.2.

The functional $\lambda_{G}$ can be used to determine the optimal metric among a certain subclass of metrics of interest satisfying the condition (4.5). For instance, it may be of interest to find the optimal constant metric $G$ (see Proposition 4.3 below), or to find the best metric within a finite family of metrics. On the other hand the constraint (4.5) forces feasible metrics to induce diffusions that are not expensive to discretize.

To illustrate the previous remark we show that for a Gaussian target measure the optimal preconditioner is, unsurprisingly, given by the Fisher information. More precisely we have the following proposition:

Proposition 4.3.

Let $\mu=N(0,\Sigma).$ Then

[TABLE]

maximizes $\lambda_{G}$ over the class of constant metrics $G$ satisfying $\|G^{-1}\Sigma^{-1}\|\leq 1,$ as in (4.5). Moreover, the maximum value is

[TABLE]

Proof.

Suppose for the sake of contradiction that there exists a constant metric $G$ that satisfies condition (4.5), which in this case reads $\lVert G^{-1}\Sigma^{-1}\rVert\leq 1$ and is such that $\lambda_{G}>\lambda_{G^{*}}.$

Let $u$ be a unit norm eigenvector of $G$ with eigenvalue $\lambda>0$ . Notice that by definition of $\lambda_{G}$ we must have

[TABLE]

The left hand side of the above display can be rewritten as

[TABLE]

and by Cauchy-Schwartz inequality we see that

[TABLE]

Since $u$ is an eigenvector of $G$ with eigenvalue $\lambda$ , it follows that $u$ is also an eigenvector of $G^{1/2}$ with eigenvalue $\sqrt{\lambda}$ and of $G^{-1/2}$ with eigenvalue $\frac{1}{\sqrt{\lambda}}$ . Therefore the right hand side of the above display is equal to one. This however contradicts (4.7). From this we deduce the optimality of $G^{*}$ among feasible metrics.

∎

Example 2.

Suppose that $F(u)=\frac{1}{2}\langle\Sigma^{-1}u,u\rangle,$ with

[TABLE]

Consider the optimal metric

[TABLE]

given by the previous proposition and the rescaled Euclidean metric $G_{e}$

[TABLE]

where the scalings have been chosen so that

[TABLE]

A calculation then shows that $\lambda_{G^{*}}=1$ while $\lambda_{G_{e}}=\epsilon.$ Note that if the Euclidean metric is not rescaled by $\epsilon^{-1}$ —violating the constraint (4.5)— then the same unit rate of convergence as with the metric $G^{*}$ is achieved. However, the drift of the associated diffusion

[TABLE]

is of order $\epsilon^{-1},$ making the discretization increasingly expensive in the small $\epsilon$ limit. On the other hand, since both $G^{*-1}$ and $G_{e}^{-1}$ are of order $\epsilon$ , the drifts for both associated diffusions are order $1$ . This motivates our choice of constraint in equation (4.5).

5 Example: Semi-Supervised Learning

In this section we study the geodesic convexity of functionals arising in the Bayesian formulation of semi-supervised classification. Our purpose is to illustrate the concepts in a tangible setting, and to show that establishing sharp levels of geodesic convexity may be more tractable for some functionals than others.

In semi-supervised classification one is interested in the following task: given a data cloud $X=\{x_{1},\dots,x_{n}\}$ together with (noisy) labels $y_{i}\in\{-1,1\}$ for some of the data points $x_{i},$ $i\in\mathcal{Z}\subset\{1,\dots,n\}$ , classify the unlabeled data points by assigning labels to them. Here we assume to have access to a weight matrix $W$ quantifying the level of similarity between the points in $X$ . Thus, we focus on the graph-based approach to semi-supervised classification, which boils down to propagating the known labels to the whole cloud, using the geometry of the weighted graph $(X,W)$ . We will investigate the existence and convergence of gradient flows for several Bayesian graph-based classification models proposed in Bertozzi et al. (2017). In the Bayesian approach, the geometric structure that the weighted graph imposes on the data cloud is used to build a prior on a latent space, and the noisy given labels are used to build the likelihood. The Bayesian solution to the classification problem is a measure on the latent space, that is then push-forwarded into a measure on the label space $\{-1,1\}^{n}$ . This latter measure contains information on the most likely labels, and also provides a principled way to quantify the remaining uncertainty on the classification process.

Let $(X,W)$ then be a weighted graph, where $X=\{x_{1},\dots,x_{n}\}$ is the set of nodes of the graph and $W$ is the weight matrix between the points in $X$ . All the entries of $W$ are non-negative real numbers and we assume that $W$ is symmetric. Let $L$ be the graph Laplacian matrix defined by

[TABLE]

where $D$ is the degree matrix of the weighted graph, i.e., the diagonal matrix with diagonal entries $D_{ii}=\sum_{j=1}^{n}W_{ij}$ . The above corresponds to the unnormalized graph Laplacian, but different normalizations are possible Von Luxburg (2007). The graph-Laplacian will be used in all the models below to favor prior draws of the latent variables that are consistent with the geometry of the data cloud.

Remark 5.1.

A special case of a weighted graph $(X,W)$ frequently found in the literature is that in which the points in $X$ are i.i.d. points sampled from some distribution on a manifold $\mathcal{M}$ embedded in $\mathbb{R}^{d}$ , and the similarity matrix $W$ is obtained as

[TABLE]

In the above, $K$ is a compactly supported kernel function, $\lvert x_{i}-x_{j}\rvert$ is the Euclidean distance between the points $x_{i}$ and $x_{j},$ and $r>0$ is a parameter controlling data density. It can be shown (see Burago et al. (2013) and Garcia Trillos et al. (2017a)) that the smallest non-trivial eigenvalue of a rescaled version of the resulting graph Laplacian is close to the smallest non-trivial eigenvalue of a weighted Laplacian on the manifold, provided that $r$ is scaled with $n$ appropriately.

We will now study the probit and logistic models in subsection 5.1, and then the Ginzburg-Landau model in 5.2.

5.1 Probit and Logistic Models

Traditionally, the probit approach to semi-supervised learning is to classify the unlabeled data points by first optimizing the functional $G:\mathbb{R}^{n}\to\mathbb{R}$ given by

[TABLE]

over all $u\in\mathbb{R}^{n}$ satisfying $\sum_{i=1}^{n}u_{i}=0,$ and then thresholding the optimizer with the sign function; the parameter $\alpha>0$ is used to regularize the functions $u$ . The minimizer of the functional $G$ can be interpreted as the MAP (maximum a posteriori estimator) in the Bayesian formulation of probit semi-supervised learning (see Bertozzi et al. (2017)) that we now recall:

Prior: Consider the subspace $U:=\{u\in\mathbb{R}^{n}\>:\>\sum_{i=1}^{n}u_{i}=0\}$ and let $\pi$ be the Gaussian measure on $U$ defined by

[TABLE]

The measure $\pi$ is interpreted as a prior distribution on the space of real valued functions on the point cloud $X$ with average zero. Larger values of $\alpha>0$ force more regularization of the functions $u$ .

Likelihood function: For a fixed $u\in U$ and for $j\in\mathcal{Z}$ define

[TABLE]

where the $\eta_{j}$ are i.i.d. $N(0,\gamma^{2}),$ and $S$ is the sign function. This specifies the distribution of observed labels given the underlying latent variable $u$ . We then define, for given data $y$ , the negative log-density function

[TABLE]

where $H$ is given by (5.1).

Posterior distribution: As shown in Bertozzi et al. (2017), a simple application of Bayes’ rule gives the posterior distribution of $u$ given $y$ (denoted by $\mu^{y}$ ):

[TABLE]

where $\Psi$ is given by (5.2), and $\phi$ is given by (5.3).

From what has been discussed in the previous sections, the posterior $\mu^{y}$ can be characterized as the unique minimizer of the energy

[TABLE]

Let us first consider the gradient flow of $J_{\mbox{\tiny{\rm KL}}}$ with respect to the usual Wassertsein space (i.e. the one induced by the Euclidean distance).

We can study the geodesic convexity of this functional by studying independently the convexity properties of $D_{\mbox{\tiny{\rm KL}}}(\nu\|\pi)$ and of $\phi(\cdot;y)$ . Precisely:

i)

Since $\pi$ is a Gaussian measure with covariance $L^{-\alpha}$ , Example 1 shows that $D_{\mbox{\tiny{\rm KL}}}(\nu\|\pi)$ is $(\Lambda_{\min}(L))^{\alpha}$ -geodesically convex in Wasserstein space, where $\Lambda_{\min}(L)$ is the smallest non-trivial eigenvalue of $L.$ 2. ii)

The function $\phi(\cdot;y)$ is convex —see the appendix of Bertozzi et al. (2017). Hence, the functional $F_{\mbox{\tiny{\rm KL}}}(\nu)=\int_{\mathbb{R}^{n}}\phi(u;y)d\nu(u)$ is [math]-geodesicaly convex in Wasserstein space.

It then follows from Proposition 3.1 that $J_{\mbox{\tiny{\rm KL}}}$ is $(\Lambda_{\min}(L))^{\alpha}$ -geodesically convex in Wasserstein space. As a consequence, if we consider $t\in[0,\infty)\mapsto\mu_{t}$ , the gradient flow of $J_{\mbox{\tiny{\rm KL}}}$ with respect to the Wasserstein distance starting at $\mu_{0}$ (an absolutely continuous measure with respect to $\mu$ ), geometric inequalities can be immediately obtained from (3.7); such inequalities will not deteriorate with $n$ —see Remark 5.1.

However, the diffusion associated to this flow is given by

[TABLE]

and in particular its drift (more precisely the term $L^{\alpha}X_{t}$ ) deteriorates as $n$ gets larger. Notice that if we wanted to control the cost of discretization by rescaling the Euclidean metric (as exhibited in Example 2), the geodesic convexity of the resulting flow would vanish as $n$ gets larger.

The previous discussion shows that the flow of $J_{\mbox{\tiny{\rm KL}}}$ in the usual Wasserstein sense does not produce a flow with good convergence properties that at the same time is cheap to discretize (robustly in $n$ ). This motivates considering the gradient flow of $J_{\mbox{\tiny{\rm KL}}}$ with respect to the Wasserstein distance induced by a certain constant metric $g$ . Indeed, inspired by Proposition 4.3, let us consider the constant metric tensor

[TABLE]

Since the metric tensor is constant, in particular its induced volume form $vol_{g}$ is proportional to the Lebesgue measure and hence we can write

[TABLE]

On the other hand, from the discussion in Section 3.3.1 we know that the densities of the stochastic process

[TABLE]

correspond to to the gradient flow of the energy $J_{\mbox{\tiny{\rm KL}}}$ with respect to the Wasserstein distance induced by the metric $g$ , where $B^{g}$ is a Brownian motion on $(\mathbb{R}^{m},g)$ . This diffusion can be rewritten in terms of the standard Euclidean gradient $\nabla$ and Brownian motion $B$ as

[TABLE]

after noticing that

[TABLE]

where for the second identity we have used the fact that $G$ is constant. How convex is the energy $J_{\mbox{\tiny{\rm KL}}}$ with respect to the Wasserstein distance induced by $g$ ? Since the metric tensor $G$ is constant it follows that

[TABLE]

where $F(u):=\phi(u;y)+\frac{1}{2}\langle L^{\alpha}u,u\rangle$ . Finally, due to the convexity of $\phi(u;y)$ we deduce that

[TABLE]

We notice that in (5.6) $L$ appears as $L^{-\alpha}$ . This is a fundamental difference from (5.5) (where $L$ appears as $L^{\alpha}$ ) with computational advantages, given that the eigenvalues of $L$ grow towards infinity.

Remark 5.2.

A carefully designed discretization of (5.6) induces the so called Langevin pCN proposal for MCMC computing (see Cotter et al. (2013)).

Remark 5.3.

In the above we have considered a probit model for the likelihood function. The ideas generalize straightforwardly to other settings, notably the logistic model

[TABLE]

where

[TABLE]

The convexity of $\phi$ for the logistic model (5.7) can be established by direct computation of the second derivative of $\sigma.$

5.2 Ginzburg-Landau Model

We now present the Ginzburg-Landau model for semi-supervised learning. This model will provide us with an example of a functional $J_{\mbox{\tiny{\rm KL}}}$ whose geodesic convexity with respect to Wasserstein distance is not positive (and hence one can not deduce geometric inequalities describing the rate of convergence towards the posterior), but for which one can obtain a positive spectral gap giving the rate of convergence of the flow of Dirichlet energy in the $L^{2}$ sense.

Let

[TABLE]

We consider the following Bayesian model.

Prior:

[TABLE]

Likelihood function: For $j\in\mathcal{Z},$

[TABLE]

This leads to the following negative log-density function:

[TABLE]

Posterior distribution: Combining the prior and the likelihood via Bayes’ formula gives the posterior distribution

[TABLE]

where $\Psi$ is given by (5.8), and $\phi$ is given by (5.9).

For this model, the negative prior log-density $\Psi$ is not convex, and Wasserstein $\lambda$ -geodesic convexity of the functional $D_{\mbox{\tiny{\rm KL}}}(\cdot\|\pi)$ only holds for negative $\lambda.$ In particular, it is not possible to deduce exponential decay taking the Wasserstein flow point of view. However, in the $L^{2}$ /Dirichlet energy setting we can still show exponential convergence towards the posterior $\mu^{y}$ . Indeed, because the negative log-likelihood of $\mu^{y}$ satisfies:

[TABLE]

there exists some $\lambda>0$ for which $\mu^{y}$ has a Poincaré inequality with constant $\lambda$ (see Chapter 4.5 in Pavliotis (2014)). In this example we can say more, and in particular we are able to find a Poincaré constant that depends explicitly on $\varepsilon$ , the smallest non-trivial eigenvalue of $L,$ and $k:=|\mathcal{Z}|$ .

Let $\psi(u):=\sum_{j\in\mathcal{Z}}W_{\varepsilon}(u_{j})$ and let $\psi_{c}$ be its convex envelope, i.e. let $\psi_{c}$ be the largest convex function that is below $\psi$ . It is straightforward to show that $\psi_{c}(0)=0$ and that

[TABLE]

Consider now the probability measure $\mu_{c}$ with Lebesgue density

[TABLE]

and define $\lambda_{2}$ and $\lambda_{2,c}$ as in Remark 3.5 using $\mu^{y}$ and $\mu_{c}$ instead of $\mu$ . For any given $f\in L^{2}(\mu)$ we then have

[TABLE]

where the first inequality follows from the fact that $f_{\mu}=\mbox{argmin}_{a\in\mathbb{R}}\int|f-a|^{2}d\mu$ and the second inequality follows directly from the fact that $0\geq\psi_{c}-\psi\geq-\frac{k}{\varepsilon}$ . It follows that

[TABLE]

where the last inequality follows from the fact that the negative log-likelihood of $\mu^{y}$ satisfies the Bakry-Emery condition with constant $\Lambda_{\min}(L)$ (see Chapter 4.5 in Pavliotis (2014)). Clearly, the Poincaré constant above is very large for small $\epsilon$ or for large $k$ (number of labeled data points). We also notice that the cost of discretization of the diffusion associated to this flow increases with $n$ (as in Section 5.1).

Remark 5.4.

A similar analysis can be carried out now using the constant metric

[TABLE]

More precisely, consider the flow of the Dirichlet energy

[TABLE]

with respect to $L^{2}(\mu)$ . How convex is this functional? For every $f\in U$ we have

[TABLE]

from where it follows that

[TABLE]

A similar remark to the one at the end of section 5.1 regarding the dependence in $L$ of the resulting diffusion applies here as well.

6 Conclusions and Future Work

The main contribution of this paper is to explore three variational formulations of the Bayesian update and their associated gradient flows. We have shown that, for each of the three variational formulations, the geodesic convexity of the objective functionals gives a bound on the rate of convergence of the flows to the posterior. As an application of the theory, we have suggested a criterion for the optimal choice of metric in Riemannian MCMC schemes. We summarize below some additional outcomes and directions for further work.

•

We bring attention to different variational formulations of the Bayesian update. These formulations have the potential of extending the theory of Bayesian inverse problems in function spaces, in particular in cases with infinite dimensional, non-additive, and non-Gaussian observation noise. Moreover, they suggest numerical approximations to the posterior by restricting the space of allowed measures in the minimization, by discretization of the associated gradient flows, or by sampling via simulation of the associated diffusion.

•

The variational framework considered in this paper provides a natural setting for the study of robustness of Bayesian models, and for the analysis of convergence of discrete to continuum Bayesian models. Indeed, the authors Garcia Trillos and Sanz-Alonso (2018), Garcia Trillos et al. (2017b) have recently established the consistency of Bayesian semi-supervised learning in the regime with fixed number of labeled data points and growing number of unlabeled data. The analysis relies on the variational formulation based on Kullback-Leibler prior penalization in equation (5.4).

•

The results in the paper give new understanding of the ubiquity of Kullback-Leibler penalizations in sampling methodology. In practice Kullback-Leibler is often used for computational and analytical tractability. The results presented in section 3.3 show that Kullback-Leibler prior penalization leads to a heat-type flow and, therefore, to an easily discretized diffusion process. On the other hand, $\chi^{2}$ prior penalization leads to a nonlinear diffusion process.

Acknowledgments

We are thankful to Matías Delgadino for pointing to us the reference Ohta and Takatsu (2011) while participating in the CNA Ki-net workshop “Dynamics and Geometry from High Dimensional Data” that took place at Carnegie Mellon University in March 2017. We are also thankful to Sayan Mukherjee for the reference Zellner (1988). Finally, we thank the anonymous editor and referees for their immense help in improving the readability of our manuscript.

Appendix A Proof of Theorem 4.1

Notice that the measure $\mu$ can be rewritten as

[TABLE]

where

[TABLE]

From Proposition 3.1 the sharp constant $\lambda$ for which $D_{\mbox{\tiny{\rm KL}}}(\cdot\|\mu)$ is $\lambda$ -geodesically convex in the $g$ -Wasserstein space is given by

[TABLE]

where $\text{Ric}_{g}$ and $\text{Hess}_{g}$ stand for the Ricci curvature and Hessian in the $g$ -metric. To establish the proposition, it suffices to show that for any given $x\in\mathbb{R}^{m}$ , $\min_{v:g(v,v)=1}\text{Ric}_{g}(v,v)+\text{Hess}_{g}F_{g}$ is equal to the smallest eigenvalue of the matrix $B+\text{Hess}\,F-C$ .

Let us start by recalling that the $g$ -Hessian of a $C^{2}$ function $I$ , denoted by $\text{Hess}_{g}I$ , is the symmetric $(2,0)$ -tensor satisfying

[TABLE]

for every $v\in\mathbb{R}^{m}$ and every constant speed $g$ -geodesic curve with $\gamma(0)=x$ and $\dot{\gamma}(0)=v$ . It is convenient to rewrite $\text{Hess}_{g}I(v,v)$ in terms of the Christoffel symbols of the metric $g$ , the Euclidean inner product, and the regular (Euclidean) gradient and Hessian of the function $I$ . Let $\gamma:(-\varepsilon,\varepsilon)\rightarrow\mathbb{R}^{m}$ be a constant speed $g$ -geodesic with $\gamma(0)=x$ , $\dot{\gamma}(0)=v$ . We can then write:

[TABLE]

where $\nabla I$ and $\text{Hess}\,I$ are the usual gradient and Hessian matrix of $I,$ respectively. The acceleration of the curve $\gamma$ can be written in terms of the Christoffel symbols. Namely, if we write $\gamma$ in coordinates as

[TABLE]

the following system of second order ODEs holds:

[TABLE]

Plugging the geodesic equations back into (A.1), and setting $t=0$ , it follows that

[TABLE]

where $A(I)$ is the matrix with coordinates

[TABLE]

Hence,

[TABLE]

Taking $I:=F_{g}$ we can write the coordinate $ij$ of the matrix $\text{Hess}\,I-A(I)$ as

[TABLE]

Using now the fact that the $g$ -divergence of the vector field $\frac{\partial}{\partial x_{l}}$ can be written in terms of the Christoffel symbols as

[TABLE]

we deduce that

[TABLE]

On the other hand, the Ricci curvature $\text{Ric}_{g}(v)$ can be written in terms of the Christoffel symbols and its derivatives (alternatively in terms of the metric and its first and second order derivatives) as

[TABLE]

where $R$ is the (symmetric) matrix with entries

[TABLE]

see do Carmo Valero (1992) for details. After some cancellations using the symmetry of the symbols, we obtain that

[TABLE]

and so

[TABLE]

Using (A.2) and (A.3) we deduce that

[TABLE]

Therefore the variational problem (for every fixed $x\in\mathbb{R}^{m}$ )

[TABLE]

can be rewritten, applying the change of variables $w:=G^{1/2}v,$ as

[TABLE]

In turn this coincides with

[TABLE]

i.e., the smallest eigenvalue of the matrix

[TABLE]

This concludes the proof of the first part of the theorem.

Now we show the scaling property (4.4). Let $\widetilde{G}=aG$ . By definition

[TABLE]

where $\widetilde{B}$ and $\widetilde{C}$ are defined as in (4.2) and (4.3) but in terms of the metric $\tilde{G}.$ From the expression (4.1) for the Christoffel symbols, it follows that they are invariant under rescaling of the metric and, since $B,\,\widetilde{B}$ and $C,\,\widetilde{C}$ depend on the metric only through the symbols, we deduce that $\widetilde{B}=B$ , $\widetilde{C}=C.$ Therefore,

[TABLE]

∎

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ambrosio et al. (2008) Ambrosio, L., Gigli, N., and Savaré, G. (2008). Gradient flows: in metric spaces and in the space of probability measures . Springer Science & Business Media.
2Attias (1999) Attias, H. (1999). “A Variational Bayesian Framework for Graphical Models.” In NIPS , volume 12.
3Bertozzi et al. (2017) Bertozzi, A. L., Luo, X., Stuart, A. M., and Zygalakis, K. C. (2017). “Uncertainty quantification in the classification of high dimensional data.”
4Besag (1994) Besag, J. E. (1994). “Comments on “Representations of knowledge in complex systems” by U. Grenander and M. I. Miller.” J. Roy. Statist. Soc. Ser. B , 56: 591–592.
5Burago et al. (2001) Burago, D., Burago, Y., and Ivanov, S. (2001). A course in metric geometry , volume 33 of Graduate Studies in Mathematics . American Mathematical Society, Providence, RI. URL https://doi.org/10.1090/gsm/033 · doi ↗
6Burago et al. (2013) Burago, D., Ivanov, S., and Kurylev, Y. (2013). “A graph discretization of the Laplace-Beltrami operator.” ar Xiv preprint ar Xiv:1301.2222 .
7Cotter et al. (2013) Cotter, S. L., Roberts, G. O., Stuart, A. M., and White, D. (2013). “MCMC methods for functions: modifying old algorithms to make them faster.” Statistical Science , 28(3): 424–446.
8do Carmo Valero (1992) do Carmo Valero, M. P. (1992). Riemannian Geometry .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

The Bayesian update: variational formulations and gradient flows

Abstract

doi:

keywords:

keywords:

1 Introduction

1.1 Comparison of Functionals and Flows

1.2 Outline

1.3 Set-up and Notation

2 Preliminaries

2.1 Geodesic Spaces and Geodesic Convexity of Functionals

Definition 2.1**.**

Remark 2.2**.**

2.2 Gradient Flows in Metric Spaces

3 Variational Characterizations of the Posterior and Gradient Flows

3.1 Variational Formulation of the Bayesian Update

3.1.1 The Functionals J\mboxKLJ_{\mbox{\tiny{\rm KL}}}J\mboxKL​ and J_{\mbox{\tiny{\rm\chi^{2}}}}

3.1.2 The Dirichlet Energy DμD^{\mu}Dμ

3.2 Geodesic Convexity and Functional Inequalities

3.2.1 Geodesic Convexity of J\mboxKLJ_{\mbox{\tiny{\rm KL}}}J\mboxKL​ and J_{\mbox{\tiny{\rm\chi^{2}}}}

Proposition 3.1**.**

Example 1**.**

Proposition 3.2**.**

3.2.2 Geodesic Convexity of Dirichlet Energy

Definition 3.3**.**

Proposition 3.4**.**

Proof.

Remark 3.5**.**

Remark 3.6**.**

3.3 PDEs and Diffusions

3.3.1 J\mboxKLJ_{\mbox{\tiny{\rm KL}}}J\mboxKL​-Wasserstein and DμD^{\mu}Dμ-L2(M,μ)L^{2}(\mathcal{M},\mu)L2(M,μ)

3.3.2 J_{\mbox{\tiny{\rm\chi^{2}}}}-Wasserstein

Remark 3.7**.**

4 Application: Sampling and Riemannian MCMC

Theorem 4.1**.**

Remark 4.2**.**

Proposition 4.3**.**

Proof.

Example 2**.**

5 Example: Semi-Supervised Learning

Remark 5.1**.**

5.1 Probit and Logistic Models

Remark 5.2**.**

Remark 5.3**.**

5.2 Ginzburg-Landau Model

Remark 5.4**.**

6 Conclusions and Future Work

Acknowledgments

Appendix A Proof of Theorem 4.1

Definition 2.1.

Remark 2.2.

3.1.1 The Functionals $J_{\mbox{\tiny{\rm KL}}}$ and $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$

3.1.2 The Dirichlet Energy $D^{\mu}$

3.2.1 Geodesic Convexity of $J_{\mbox{\tiny{\rm KL}}}$ and $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$

Proposition 3.1.

Example 1.

Proposition 3.2.

Definition 3.3.

Proposition 3.4.

Remark 3.5.

Remark 3.6.

3.3.1 $J_{\mbox{\tiny{\rm KL}}}$ -Wasserstein and $D^{\mu}$ - $L^{2}(\mathcal{M},\mu)$

3.3.2 $J_{\mbox{\tiny{\rm$ \chi^{2} $}}}$ -Wasserstein

Remark 3.7.

Theorem 4.1.

Remark 4.2.

Proposition 4.3.

Example 2.

Remark 5.1.

Remark 5.2.

Remark 5.3.

Remark 5.4.