The Bayesian update: variational formulations and gradient flows
Nicolas Garcia Trillos, Daniel Sanz-Alonso

TL;DR
This paper explores the variational and gradient flow perspectives of Bayesian updates, introducing new tools for analyzing convergence and proposing novel MCMC proposal strategies based on these insights.
Contribution
It formalizes the connection between Bayesian posteriors, variational functionals, and gradient flows, and introduces a criterion for metric choice in Riemannian MCMC methods.
Findings
Convergence rates are bounded by geodesic convexity of the functionals.
Gradient flows lead to nonlinear diffusions with the posterior as invariant distribution.
Proposed a criterion for metric selection in Riemannian MCMC.
Abstract
The Bayesian update can be viewed as a variational problem by characterizing the posterior as the minimizer of a functional. The variational viewpoint is far from new and is at the heart of popular methods for posterior approximation. However, some of its consequences seem largely unexplored. We focus on the following one: defining the posterior as the minimizer of a functional gives a natural path towards the posterior by moving in the direction of steepest descent of the functional. This idea is made precise through the theory of gradient flows, allowing to bring new tools to the study of Bayesian models and algorithms. Since the posterior may be characterized as the minimizer of different functionals, several variational formulations may be considered. We study three of them and their three associated gradient flows. We show that, in all cases, the rate of convergence of the flows to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
The Bayesian update: variational formulations and gradient flows
Nicolas Garcia Trilloslabel=e1][email protected] [
Daniel Sanz-Alonsolabel=e2][email protected] [ Department of Statistics, University of Wisconsin Madison
Department of Statistics, University of Chicago
(0000)
Abstract
The Bayesian update can be viewed as a variational problem by characterizing the posterior as the minimizer of a functional. The variational viewpoint is far from new and is at the heart of popular methods for posterior approximation. However, some of its consequences seem largely unexplored. We focus on the following one: defining the posterior as the minimizer of a functional gives a natural path towards the posterior by moving in the direction of steepest descent of the functional. This idea is made precise through the theory of gradient flows, allowing to bring new tools to the study of Bayesian models and algorithms. Since the posterior may be characterized as the minimizer of different functionals, several variational formulations may be considered. We study three of them and their three associated gradient flows. We show that, in all cases, the rate of convergence of the flows to the posterior can be bounded by the geodesic convexity of the functional to be minimized. Each gradient flow naturally suggests a nonlinear diffusion with the posterior as invariant distribution. These diffusions may be discretized to build proposals for Markov chain Monte Carlo (MCMC) algorithms. By construction, the diffusions are guaranteed to satisfy a certain optimality condition, and rates of convergence are given by the convexity of the functionals. We use this observation to propose a criterion for the choice of metric in Riemannian MCMC methods.
,
62F15,
49N99,
Gradient flows, Wasserstein Space, Convexity, Riemannian MCMC,
doi:
0000
keywords:
[class=MSC]
keywords:
††volume: 00††issue: 0
\startlocaldefs\endlocaldefs
, and
1 Introduction
In this paper we revisit the old idea of viewing the posterior as the minimizer of an energy functional. The use of variational formulations of Bayes rule seems to have been largely focused on one of its methodological benefits: restricting the minimization to a subclass of measures is the backbone of variational Bayes methods for posterior approximation. Our aim is to bring attention to two other theoretical and methodological benefits, and to study in some detail one of these: namely, that each variational formulation suggests a natural path, defined by a gradient flow, towards the posterior. We use this observation to propose a criterion for the choice of metric in Riemannian MCMC methods.
Let us recall informally a variational formulation of Bayes rule. Given a prior on an unknown parameter and a likelihood function the posterior can be characterized Zellner (1988) as the distribution that minimizes
[TABLE]
Indeed, minimizing J_{\mbox{\tiny{\rm KL}}}\bigl{(}q(u)\bigr{)} is equivalent to minimizing the Kullback-Leibler divergence and so clearly the minimizer is Other variational formulations may be considered by minimizing, for instance, other divergences rather than Kullback-Leibler. In this paper we consider three variational formulations of the Bayesian update. The first two characterize the posterior measure as the minimizer of functionals or J_{\mbox{\tiny{\rm\chi^{2}}}} constructed by penalizing deviations from the prior measure in Kullback-Leibler or divergence —definitions of these functionals and divergences are given in equations (3.2), (3.5), (2.2), (2.3). The third one characterizes the posterior density as the minimizer of the Dirichlet energy —see (2.4).
Why is it useful to view the posterior as the minimizer of an energy? We list below three advantages of this viewpoint, the third of which will be the focus of our paper.
The variational formulation provides a natural way to approximate the posterior by restricting the minimization problem to distributions satisfying some computationally desirable property. For instance, variational Bayes methods often restrict the minimization to with product structure Attias (1999), Wainwright and Jordan (2008), Fox and Roberts (2012). A similar idea is studied in Pinski et al. (2015), where is restricted to a class of Gaussian distributions. An iterative variational procedure that progressively improves the posterior approximation by enriching the family of distributions was introduced in Guo et al. (2016). 2. 2.
If the prior or the likelihood depend on a parameter then the variational formulation allows to show large convergence of posteriors by establishing the -convergence of the associated energies. This method of proof has been employed by the authors in Garcia Trillos and Sanz-Alonso (2018), Garcia Trillos et al. (2017b) to analyze the large-data consistency of graph-based Bayesian semi-supervised learning. 3. 3.
Each variational formulation gives a natural path, defined by a gradient flow, towards the posterior. These flows can be thought of as time-parameterized curves in the space of probability measures, converging in the large-time limit towards the posterior.
In this paper we study three gradient flows associated with the variational formulations defined by minimization of the functionals , J_{\mbox{\tiny{\rm\chi^{2}}}}, and For intuition, we recall that a gradient flow in Euclidean space defines a curve whose tangent always points in the direction of steepest descent of a given function –see equation (2.5). In the same fashion, a gradient flow in a more general metric space can be thought of as a curve on said space that always points in the direction of steepest descent of a given functional Ambrosio et al. (2008). In Euclidean space the direction of steepest descent is naturally defined as that in which an Euclidean infinitesimal increment leads to the largest decrease on the value of the function. In a general metric space the direction of steepest descent is the one in which an infinitesimal increment defined in terms of the distance leads to the largest decrease on the value of the functional. In this paper we study:
- (i)
The gradient flows defined by and J_{\mbox{\tiny{\rm\chi^{2}}}} in the space of probability measures with finite second moments endowed with the Wasserstein distance—definitions are given in (2.1). By construction, these flows give curves of probability measures that evolve following the direction of steepest descent of and J_{\mbox{\tiny{\rm\chi^{2}}}} in Wasserstein distance, converging to the posterior measure in the large-time limit. 2. (ii)
The gradient flow defined by the Dirichlet energy in the space of square integrable densities endowed with the distance. By construction, this flow gives a curve of densities in that evolves following the direction of steepest descent of in distance, converging to the posterior density in the large-time limit. Interestingly, the curve of measures associated with these densities is the exact same as the curve defined by the flow on Wasserstein space Jordan et al. (1998).
A question arises: what is the rate of convergence of these flows to the posterior? The answer is, to a large extent, provided by the theory of optimal transport and gradient flows Ambrosio et al. (2008), Villani (2003), Villani (2008), Santambrogio (2015). We will review and provide a unified account of these results in the main body of the paper, section 3. In the remainder of this introduction we discuss how rates of convergence may be studied in terms of convexity of functionals, and how these rates may be used as a guide for the choice of proposals for MCMC methods.
Rates of convergence of the flows hinge on the convexity level of each of the functionals J_{\mbox{\tiny{\rm\chi^{2}}}}, and Recalling the Euclidean case may be helpful: gradient descent on a highly convex function will lead to fast convergence to the minimizer. What is, however, a sensible notion of convexity for functionals defined over measures or densities? Our presentation highlights that the notion of geodesic (or displacement) convexity McCann (1997) nicely unifies the theory: it guarantees the existence and uniqueness of the three gradient flows and it also provides a bound on their rate of convergence to the posterior. In the setting one can show that positive geodesic convexity is equivalent to the posterior satisfying a Poincaré inequality, and also to the existence of a spectral gap —see subsection 3.2.2. On the other hand, the geodesic convexity of and J_{\mbox{\tiny{\rm\chi^{2}}}} in Wasserstein space is determined by the Ricci curvature of the manifold, as well as by the likelihood function and prior density —see (3.2), (3.3), (3.5), (3.6). Typically the three functionals J_{\mbox{\tiny{\rm\chi^{2}}}}, and will have different levels of geodesic convexity, and establishing a sharp bound on each of them may not be equally tractable.
The theory of gradient flows and optimal transport gives, for each of the flows, an associated Fokker-Planck partial differential equation (PDE) that governs the evolution of densities Ohta and Takatsu (2011), Santambrogio (2015). Such PDEs are typically costly to discretize if the parameter space is of moderate or high dimension, but they may be used in small-dimensional problems as a way to define tempering schemes. Here we do not explore this idea any further. Instead, we focus on the (nonlinear) diffusion processes associated with the PDEs. These diffusions are Langevin-type stochastic differential equations, whose evolving densities satisfy the Fokker-Planck equations. By construction, the invariant distribution of each of these diffusions is the sought posterior, and a bound on the rate of convergence of the diffusions to the posterior is given by the geodesic convexity of the corresponding functional. The gradient flow perspective automatically gives a sense in which the diffusions are optimal: the associated densities move locally (in Wasserstein or sense) in the direction of steepest descent of the functional. From this it immediately follows, for instance, that the law of a standard Langevin diffusion in Euclidean space evolves locally in Wasserstein space in the direction that minimizes Kullback-Leibler, and that it also evolves locally in in the direction that minimizes the Dirichlet energy.
The MCMC methodology allows to use a proposal based on a discretization of a diffusion —combined with an accept-reject mechanism to remove the discretization bias— to produce, in the large-time asymptotic, correlated posterior samples. Heuristically, the rate of convergence of the un-discretized diffusion may guide the choice of proposal. Proposals based on Langevin diffusions were first suggested in Besag (1994), and the exponential ergodicity of the resulting algorithms was analysed in Roberts and Tweedie (1996). The paper Girolami and Calderhead (2011) considered changing the metric on the parameter space in order to accelerate MCMC algorithms by taking into account the geometric structure that the posterior defines in the parameter space. This led to a new family of Riemannian MCMC algorithms. Our paper is concerned with the study of un-discretized diffusions; the effect of the accept-reject mechanism on rates and ergodicity of MCMC methods will be studied elsewhere. We suggest that a way to guide the choice of metric of Riemannian MCMC methods is to choose the one that leads to a faster rate of convergence of the diffusion under certain constraints. We emphasize that despite working with un-discretized diffusions, our guidance for choice of proposals accounts for the fact that discretization will eventually be needed. Our criterion weeds out choices of metric that lead to diffusions that achieve fast rate of convergence by merely speeding-up the drift. This is crucial, since a larger drift typically leads to a larger discretization error, and therefore to more rejections in the MCMC accept-reject mechanism in order to remove the bias in the discrete chain. This important constraint on the size of the drift seems to have been overlooked in existing continuous-time analyses of MCMC methods.
In summary, the following points highlight the key elements and common structure of the variational formulations of the Bayesian update and of the study of the associated gradient flows:
- •
The posterior can be characterized as the minimizer of different functionals on probability measures or densities.
- •
One can then study the gradient flows of these functionals with respect to a metric on the space of probability measures or densities; the resulting curve is a curve of maximal slope and its endpoint is the posterior.
- •
The gradient flows are characterized by a Fokker-Planck PDE that governs the evolution of the density of an associated diffusion process.
- •
By studying the convexity of the functionals (with respect to a given metric) one can obtain rates of convergence of the gradient flows towards the posterior. In particular, the level of convexity determines the speed of convergence of the densities of the associated diffusion process towards the posterior, and hence can be used as a criterion to guide the choice of proposals for MCMC methods; here we emphasize that care must be taken when comparing different diffusions if a higher speed of convergence is at the cost of a more expensive discretization.
The ideas in this paper immediately extend beyond the Bayesian interpretation stressed here to any application (e.g. the study of conditioned diffusions) where a measure of interest is defined in terms of a reference measure and a change of measure. Also, we consider only Kullback-Leibler and prior penalizations to define the functionals and J_{\mbox{\tiny{\rm\chi^{2}}}}, but it would be possible to extend the analysis to the family of -divergences introduced in Ohta and Takatsu (2011). Kullback-Leibler and prior penalization correspond to and within this family. In what follows we point out some of the features of the different functionals and gradient flows that we consider in this paper.
1.1 Comparison of Functionals and Flows
We now provide a comparison of the three choices of functionals that we consider.
The two gradient flows in Wasserstein space (arising from the functionals and J_{\mbox{\tiny{\rm\chi^{2}}}}) are fundamentally connected with the variational formulation: these variational formulations can be used to define posterior-type measures via a penalization of deviations from the prior and deviations from the data in situations where establishing the existence of conditional distributions by disintegration of measures is technically demanding. On the other hand, the variational formulation for the Dirichlet energy is less natural and requires previous knowledge of the posterior. 2. 2.
The precise level of geodesic convexity of the functionals (and J_{\mbox{\tiny{\rm\chi^{2}}}}) can be computed from point evaluation of the Ricci tensor (of the parameter space) and derivatives of the densities. In particular, knowledge of the underlying metric suffices to compute these quantities. In contrast, establishing a sharp Poincaré inequality —the level of geodesic convexity of the Dirichlet energy in — is in practice unfeasible, as it effectively requires solving an infinite dimensional optimization problem. It is for this reason —and because of the explicit dependence of the convexity in Wasserstein space with the geometry induced by the manifold metric tensor— that our analysis of the choice of metric in Riemannian MCMC methods is based on the functional (see section 4, and in particular Theorem 4.1). 3. 3.
On the flip side of point 2, a Poincaré inequality for the posterior with a not necessarily optimal constant can be established using only tail information. In particular, even when the functional is not geodesically convex in Wasserstein space, one may still be able to obtain a Poincaré inequality (see subsection 5.2 for an example). 4. 4.
In contrast to the diffusions arising from the or Dirichlet flows, the stochastic processes arising from the J_{\mbox{\tiny{\rm\chi^{2}}}} formulation are inhomogeneous, and hence simulation seems more challenging unless further structure is assumed on the prior measure and likelihood function. Also, the evolution of densities of the gradient flow of J_{\mbox{\tiny{\rm\chi^{2}}}} in Wasserstein space is given by a porous medium PDE.
1.2 Outline
The rest of the paper is organized as follows. Section 2 contains some background material on the Wasserstein space, geodesic convexity of functionals, and gradient flows in metric spaces. The core of the paper is section 3, where we study the geodesic convexity, PDEs, and diffusions associated with each of the three functionals J_{\mbox{\tiny{\rm\chi^{2}}}}, and In section 4 we consider an application of the theory to the choice of metric in Riemannian MCMC methods Girolami and Calderhead (2011), and in section 5 we illustrate the main concepts and ideas through examples arising in Bayesian formulations of semi-supervised learning Garcia Trillos and Sanz-Alonso (2018), Garcia Trillos et al. (2017b), Bertozzi et al. (2017). We close in section 6 by summarizing our main contributions and pointing to open directions.
1.3 Set-up and Notation
will denote a smooth connected -dimensional Riemannian manifold with metric tensor representing the parameter space. We will denote by the associated Riemannian distance, and assume that is a complete metric space. By the Hopf-Rinow theorem it follows that is a * geodesic space* —we refer to subsection (2.1) for a discussion on geodesic spaces and their relevance here. We denote by the associated volume form. To emphasize the dependence of differential operators on the metric with which is endowed, we write , and for the gradient, divergence, Hessian, and Laplace Beltrami operators on The reader not versed in Riemannian geometry may focus on the case with the usual metric tensor, in which case is the Euclidean distance and is the Lebesgue measure. However, in section 4 where we discuss applications to Riemannian MCMC, we endow with a general metric tensor and hence familiarity with some notions from differential geometry is desirable.
We denote by the space of probability measures on (endowed with the Borel -algebra). We will be concerned with the update of a prior probability measure —that represents various degrees of belief on the value of a quantity or parameter of interest— into a posterior probability measure , based on observed data . We will assume that the prior is defined as a change of measure from and that the posterior is defined as a change of measure from as follows:
[TABLE]
The data is incorporated in the Bayesian update through the negative log-likelihood function
2 Preliminaries
In this section we provide some background material. The Wasserstein space, and the notion of -geodesic convexity of functionals are reviewed in subsection 2.1. Gradient flows in metric spaces are reviewed in subsection 2.2.
2.1 Geodesic Spaces and Geodesic Convexity of Functionals
A geodesic space is a metric space with a notion of length of curves that is compatible with the metric, and where every two points in the space can be connected by a curve whose length achieves the distance between the points (see Burago et al. (2001) for more details). Geodesic spaces constitute a large family of metric spaces with a rich theory of gradient flows. Here we consider three geodesic spaces. First, the base space , i.e. the manifold equipped with its Riemannian distance. Second, the space of square integrable Borel probability measures defined on , endowed with the Wasserstein distance . Third, the space of functions , with equipped with the norm.
We spell out the definitions of and :
[TABLE]
The infimum in the previous display is taken over all transportation plans between and , i.e. over with marginals and on the first and second factors. The space is indeed a geodesic space: geodesics in are induced by those in . All it takes to construct a geodesic connecting and is to find an optimal transport plan between and to determine source locations and target locations, and then transport the mass along geodesics in (see Villani (2003) and Santambrogio (2015)).
The space of functions , with equipped with the norm is also a geodesic space, where a constant speed geodesic connecting and is given by linear interpolation: .
We will consider several functionals throughout the paper. They will all be defined in one of our three geodesic spaces —that is, , or . Important examples will be, respectively:
Functions 2. 2.
The Kullback-Leibler and divergences D_{\mbox{\tiny{\rm\chi^{2}}}}(\cdot\|\pi):\mathcal{P}(\mathcal{M})\to[0,\infty], where is a given (prior) measure and, for
[TABLE]
[TABLE]
and the potential-type functional given by
[TABLE]
where is a given potential function. 3. 3.
The Dirichlet energy defined by
[TABLE]
Recall that here and throughout, denotes the gradient in and is the norm on each tangent space .
A crucial unifying concept will be that of -geodesic convexity of functionals. We recall it here:
Definition 2.1**.**
Let be a geodesic space and let . A functional is called -geodesically convex provided that for any there exists a constant speed geodesic such that and
[TABLE]
The following remark characterizes the -convexity of functionals when .
Remark 2.2**.**
Let so that we can define its Hessian at all points in (see the proof of Theorem 4.1 in the appendix for the definition). Then the following conditions are equivalent:
- (i)
* is -geodesically convex.* 2. (ii)
* for all and all unit vectors *
If is the Euclidean space, (i) and (ii) are also equivalent to:
- (iii)
* is a convex function.*
This latter condition is known in the optimization literature as strong convexity.
2.2 Gradient Flows in Metric Spaces
In this subsection we review the basic concepts needed to define gradient flows in a metric space . We follow Chapter 8 of Santambrogio (2015); a standard technical reference is Ambrosio et al. (2008).
To guide the reader, we first recall the formulation of gradient flows in Euclidean space, where and is the Euclidean metric. Let be a differentiable function, and consider the equation
[TABLE]
Then, the solution to (2.5) is the gradient flow of in Euclidean space with initial condition ; it is a curve whose tangent vector at every point in time is the negative of the gradient of the function at that time. In order to generalize the notion of a gradient flow to functionals defined on more general metric spaces, and in particular when the metric space has no differential structure, we reformulate (2.5) in integral form by using that \frac{d}{dt}E\bigl{(}x(t)\bigr{)}=\langle\nabla E\bigl{(}x(t)\bigr{)},\dot{x}(t)\rangle=-\frac{1}{2}|\dot{x}(t)|^{2}-\frac{1}{2}|\nabla E\bigr{(}x(t)\bigl{)}|^{2} as follows:
[TABLE]
This identity, known as energy dissipation equality, is equivalent to (2.5) —see Chapter 8 of Santambrogio (2015) for further details and other possible formulations. Crucially (2.6) involves notions that can be defined in an arbitrary metric space : the metric derivative of a curve is given by
[TABLE]
and the slope of a functional is defined as the map given by
[TABLE]
The identity (2.6) is the standard way to introduce gradient flows in arbitrary metric spaces. In this paper we consider gradient flows in and Wasserstein spaces, where the notion of tangent vector is available. has Hilbert space structure, whereas the Wasserstein space can be seen as an infinite dimensional manifold (see Ambrosio et al. (2008), Santambrogio (2015)).
3 Variational Characterizations of the Posterior and Gradient Flows
In this section we lay out the main elements of the theory of variational formulations and gradient flows in regards to the Bayesian update. Subsection 3.1 details three variational formulations defined in terms of the functionals , J_{\mbox{\tiny{\rm\chi^{2}}}} and the Dirichlet energy . Subsection 3.2 studies the geodesic convexity of and J_{\mbox{\tiny{\rm\chi^{2}}}} in Wasserstein space and of in . Finally, subsection 3.3 collects the PDEs that characterize the gradient flows, as well as the corresponding diffusion processes.
3.1 Variational Formulation of the Bayesian Update
The variational formulation of the posterior as the minimizer of and J_{\mbox{\tiny{\rm\chi^{2}}}} share the same structure and will be outlined first. The variational formulation in terms of the Dirichlet energy will be given below.
3.1.1 The Functionals and J_{\mbox{\tiny{\rm\chi^{2}}}}
In mathematical analysis Jordan and Kinderlehrer (1996) and probability theory Dupuis and Ellis (2011) it is often useful to note that a probability measure defined by
[TABLE]
is the minimizer of the functional
[TABLE]
where
[TABLE]
and the integral is interpreted as if is not integrable with respect to . In physical terms, the Kullback-Leibler divergence represents an internal energy, represents a potential energy, and the constant is known as the partition function. Here we are concerned with a statistical interpretation of equation (3.1), and view it as defining a posterior measure as a change of measure from a prior measure. In this context, the Kullback-Leibler term in (3.2) represents a penalization of deviations from prior beliefs, the term penalizes deviations from the data, and the normalizing constant represents the marginal likelihood. For brevity, we will henceforth suppress the data from the negative log-likelihood function , writing instead of
We remark that the fact that minimizes follows immediately from the identity
[TABLE]
Minimizing or is thus equivalent, but the functional makes apparent the roles of the prior and the likelihood.
The posterior also minimizes the functional
[TABLE]
where
[TABLE]
We refer to Ohta and Takatsu (2011) for details. Note that both and J_{\mbox{\tiny{\rm\chi^{2}}}} are defined in terms of the two starting points of the Bayesian update: the prior and the negative log-likelihood The associated variational formulations suggest a way to define posterior-type measures based on these two ingredients in scenarios where establishing the existence of conditional distributions via desintegration of measures is technically demanding. This appealing feature of the two variational formulations above is not shared by the one described in the next subsection.
3.1.2 The Dirichlet Energy
Let now the posterior be given, and consider the space of functions defined on which are square integrable with respect to . Recall the Dirichlet energy
[TABLE]
introduced in equation (2.4). Now, since the measure can be characterized as the probability measure with density a.s. with respect to it follows that the posterior density is the minimizer of the Dirichlet energy over probability densities with
3.2 Geodesic Convexity and Functional Inequalities
In this section we study the geodesic convexity of the functionals , J_{\mbox{\tiny{\rm\chi^{2}}}}, and . The geodesic convexity of and J_{\mbox{\tiny{\rm\chi^{2}}}} in Wasserstein space is considered first, and will be followed by the geodesic convexity of in . We will show the equivalence of the latter to the posterior satisfying a Poincaré inequality.
3.2.1 Geodesic Convexity of and J_{\mbox{\tiny{\rm\chi^{2}}}}
The next proposition can be found in von Renesse and Sturm (2005) and Sturm (2006). It shows that the convexity of can be determined by the so-called curvature-dimension condition —a condition that involves the curvature of the manifold and the Hessian of the combined change of measure We recall the notation and .
Proposition 3.1**.**
Suppose that Then (or ) is -geodesically convex if, and only if,
[TABLE]
where denotes the Ricci curvature tensor.
We recall that the Ricci curvature provides a way to quantify the disagreement between the geometry of a Riemannian manifold and that of ordinary Euclidean space. The Ricci tensor is defined as the trace of a map involving the Riemannian curvature (see do Carmo Valero (1992)).
The following example illustrates the geodesic convexity of for Gaussian
Example 1**.**
Let be a Gaussian measure in (endowed with the Euclidean metric), with positive definite. Then is -geodesically convex, where is the largest eigenvalue of This follows immediately from the above, since here and the Euclidean space is flat (its Ricci curvature is identically equal to zero). Note that the level of convexity of the functional depends only on the largest eigenvalue of the covariance, but not on the dimension of the underlying space.
The -convexity of guarantees the existence of the gradient flow of in Wasserstein space. Moreover, it determines the rate of convergence towards the posterior Precisely, if is absolutely continuous with respect to and if , then the gradient flow of with respect to the Wasserstein metric starting at is well defined and we have:
[TABLE]
The second inequality, known as Talagrand inequality Villani (2003), establishes a comparison between Wasserstein geometry and information geometry. It can be established directly combining the -geodesic convexity of (for positive ) with the first inequality. From (3.7) we see that a higher level of convexity of allows to guarantee a faster rate of convergence towards the posterior distribution .
We now turn to the geodesic convexity properties of J_{\mbox{\tiny{\rm\chi^{2}}}}. We recall that denotes the dimension of the manifold The following proposition can be found in (Ohta and Takatsu, 2011, Theorem 4.1).
Proposition 3.2**.**
J_{\mbox{\tiny{\rm\chi^{2}}}}* is -geodesically convex if and only if both of the following two properties are satisfied:*
** 2. 2.
* is -geodesically convex as a real valued function defined on .*
There are two main conclusions we can extract from the previous proposition. First, that condition 1) is only related to the prior distribution whereas condition 2) is only related to the likelihood; in particular, the convexity properties of J_{\mbox{\tiny{\rm\chi^{2}}}} can indeed be studied by studying separately the prior and the likelihood (notice that the proposition gives an equivalence). Secondly, notice that condition 1) is a qualitative property and if it is not met there is no hope that the functional J_{\mbox{\tiny{\rm\chi^{2}}}} has any level of global convexity even when the likelihood function is a highly convex function. In addition, if 1) is satisfied, the convexity of determines completely the level of convexity of J_{\mbox{\tiny{\rm\chi^{2}}}}. These features are markedly different from the ones observed in the Kullback-Leibler case.
As for the functional , one can establish the following functional inequalities, under the assumption of -geodesic convexity of J_{\mbox{\tiny{\rm\chi^{2}}}} for :
[TABLE]
The above inequalities exhibit the fact that a higher level of convexity of J_{\mbox{\tiny{\rm\chi^{2}}}} guarantees a faster convergence towards the posterior distribution .
3.2.2 Geodesic Convexity of Dirichlet Energy
We now study the geodesic convexity of the Dirichlet energy functional defined in equation (2.4). In what follows we denote by the norm with respect to Let us start recalling Poincaré inequality.
Definition 3.3**.**
We say that a Borel probability measure on has a Poincaré inequality with constant if for every satisfying we have
[TABLE]
We now show that Poincaré inequalities are directly related to the geodesic convexity of the functional in the space.
Proposition 3.4**.**
Let be a positive real number and let be a Borel probability measure on . Then, the measure has a Poincaré inequality with constant if and only if the functional is -geodesically convex in the space of functions satisfying .
Proof.
First of all we claim that
[TABLE]
for all and every . To see this, it is enough to assume that both and are finite and then notice that equality (3.9) follows from the easily verifiable fact that for an arbitrary Hilbert space with induced norm one has
[TABLE]
Now, suppose that has a Poincaré inequality with constant and consider two functions satisfying . Then, (3.9) combined with Poincaré inequality (taking ) gives:
[TABLE]
which is precisely the -geodesic convexity condition for .
Conversely, suppose that is -geodesic convex in the space of functions that integrate to one. Let be such that and without the loss of generality assume that and that . Under these conditions, the positive and negative parts of , and , satisfy and where . The inequality
[TABLE]
is obtained directly from (3.9) and (3.10) applied to
[TABLE]
∎
Remark 3.5**.**
It is well known that the best Poincaré constant for a measure is equal to the smallest non-trivial eigenvalue of the operator defined formally as
[TABLE]
where and are the divergence and gradient operators in . This eigenvalue can be written variationally as
[TABLE]
where
[TABLE]
Remark 3.6**.**
Spectral gaps are used in the theory of MCMC as a means to bound the asymptotic variance of empirical expectations Kipnis and Varadhan (1986).
Let us now consider the flow of in with some initial condition . It is well known that this flow coincides with that of the functional in Wasserstein space. However, taking the Dirichlet- point of view, one can use a Poincaré inequality (i.e. the geodesic convexity of ) to deduce the exponential convergence of towards in the -sense. Indeed, let
[TABLE]
A standard computation then shows that
[TABLE]
In the second equality we have used that as discussed in subsection 3.3 below. Hence by Gronwall’s inequality, see e.g. Teschl (2012),
[TABLE]
3.3 PDEs and Diffusions
Here we describe the PDEs that govern the evolution of densities of the three gradient flows, and the stochastic processes associated with these PDEs. We consider first the flows defined with the functionals and and then the flow defined by the functional J_{\mbox{\tiny{\rm\chi^{2}}}}.
3.3.1 -Wasserstein and -
It was shown in Jordan et al. (1998) —in the Euclidean setting and in the unweighted case — that the gradient flow of the Kullback-Leibler functional in Wasserestein space produces a solution to the Fokker-Planck equation. More generally, under the convexity conditions guaranteeing the existence of the gradient flow of (equivalently of ) starting from , the densities
[TABLE]
satisfy (formally) the following Fokker-Planck equations
[TABLE]
[TABLE]
Equation (3.12) can be identified as the evolution of the densities (w.r.t. ) of the diffusion
[TABLE]
where denotes a Brownian motion defined on and is the gradient on . Naturally, the flow in has the same associated Fokker-Planck equation (3.11) and diffusion process (3.13).
3.3.2 J_{\mbox{\tiny{\rm\chi^{2}}}}-Wasserstein
The PDE satisfied (formally) by the densities
[TABLE]
of the J_{\mbox{\tiny{\rm\chi^{2}}}}-Wasserstein flow is the (weighted) porous medium equation:
[TABLE]
where the weighted Laplacian and divergence are defined formally as
[TABLE]
Consider now the stochastic process formally defined as the solution to the nonlinear diffusion
[TABLE]
where is the solution to (3.14). Let be the evolution of the densities (with respect to ) of the above diffusion. Then a formal computation shows that satisfies the Fokker-Planck equation:
[TABLE]
If we let we see, using (3.15), that
[TABLE]
implying that the distributions of the stochastic process (3.16) are those generated by the gradient flow of J_{\mbox{\tiny{\rm\chi^{2}}}} in Wasserstein space.
Remark 3.7**.**
In contrast with the Langevin diffusion (3.13), the process (3.16) is defined in terms of the solution of the equation satisfied by its densities. In particular, if one wanted to simulate (3.16) one would need to know the solution of (3.14) before hand.
4 Application: Sampling and Riemannian MCMC
So far we have treated the Riemannian manifold as fixed. In this section we take a different perspective and treat the metric as a free parameter. Precisely, we will now consider a family of gradient flows of the functional with respect to Wasserstein distances induced by different metrics on the parameter space. We do this motivated by the so called Riemannian MCMC methods for sampling, where a change of metric in the base space is introduced in order to produce Langevin-type proposals that are adapted to the geometric features of the target, thereby exploring regions of interest and accelerating the convergence of the chain to the posterior. There are different heuristics regarding the choice of metric (see Girolami and Calderhead (2011)), but no principled way to compare different metrics and rank their performance for sampling purposes. With the developments presented in this paper we propose one such principled criterion as we describe below. We restrict our attention to the case .
Let be a Riemannian metric tensor on defined via
[TABLE]
where for every , is a positive definite matrix. In what follows we identify with and refer to both as ‘the metric’ and we use terms such as -geodesic, -Wassertein distance, etc. to emphasize that the notions considered are being constructed using the metric . Let be the distance induced by the metric tensor and let be the associated volume form. Notice that in terms of the Lebesgue measure and the metric , we can write
[TABLE]
We use the canonical basis for as global chart for and consider the canonical vector fields . The Christoffel symbols associated to the Levi-Civita connection of the Riemannian manifold can be written in terms of derivatives of the metric as
[TABLE]
where in the right hand-side —and in what follows— we use Einstein’s summation convention. The proof of the following result is in the Appendix.
Theorem 4.1**.**
Let and
[TABLE]
The sharp constant for which (or ) is -geodesically convex in the -Wasserstein distance is equal to
[TABLE]
where is the usual (Euclidean) Hessian matrix of is the matrix with coordinates
[TABLE]
and is the matrix with coordinates
[TABLE]
Moreover, for any ,
[TABLE]
Note that is a key quantity in evaluating the quality of a metric in building geometry-informed Langevin diffusions for sampling purposes, as it gives the exponential rate at which the evolution of probabilities built using the metric converges towards the posterior: larger corresponds to faster convergence. However, in order to establish a fair performance comparison, the metrics need to be scaled appropriately. Indeed a faster rate can be obtained by scaling down the metric (which can be thought of as time-rescaling), as it is clearly seen by the scaling property (4.4) of the functional It is important to note that scaling down the metric leads to a faster diffusion, but also makes its discretization more expensive. Indeed the error of Euler discretizations is largely influenced by the Lipschitz constant of the drift. This motivates that a fair criterion for choosing the metric could be to maximize with the constraint
[TABLE]
since (where denotes the standard Euclidean gradient) is the drift of the diffusion (3.13). Note that the constraint (4.5) ensures that the metric cannot be scaled down arbitrarily while also guaranteeing that the discretizations do not become increasingly expensive. We remark that other constraints involving higher regularity requirements may be useful if higher order discretizations are desired.
Remark 4.2**.**
The functional can be used to determine the optimal metric among a certain subclass of metrics of interest satisfying the condition (4.5). For instance, it may be of interest to find the optimal constant metric (see Proposition 4.3 below), or to find the best metric within a finite family of metrics. On the other hand the constraint (4.5) forces feasible metrics to induce diffusions that are not expensive to discretize.
To illustrate the previous remark we show that for a Gaussian target measure the optimal preconditioner is, unsurprisingly, given by the Fisher information. More precisely we have the following proposition:
Proposition 4.3**.**
Let Then
[TABLE]
maximizes over the class of constant metrics satisfying as in (4.5). Moreover, the maximum value is
[TABLE]
Proof.
Suppose for the sake of contradiction that there exists a constant metric that satisfies condition (4.5), which in this case reads and is such that
Let be a unit norm eigenvector of with eigenvalue . Notice that by definition of we must have
[TABLE]
The left hand side of the above display can be rewritten as
[TABLE]
and by Cauchy-Schwartz inequality we see that
[TABLE]
Since is an eigenvector of with eigenvalue , it follows that is also an eigenvector of with eigenvalue and of with eigenvalue . Therefore the right hand side of the above display is equal to one. This however contradicts (4.7). From this we deduce the optimality of among feasible metrics.
∎
Example 2**.**
Suppose that with
[TABLE]
Consider the optimal metric
[TABLE]
given by the previous proposition and the rescaled Euclidean metric
[TABLE]
where the scalings have been chosen so that
[TABLE]
A calculation then shows that while Note that if the Euclidean metric is not rescaled by —violating the constraint (4.5)— then the same unit rate of convergence as with the metric is achieved. However, the drift of the associated diffusion
[TABLE]
is of order making the discretization increasingly expensive in the small limit. On the other hand, since both and are of order , the drifts for both associated diffusions are order . This motivates our choice of constraint in equation (4.5).
5 Example: Semi-Supervised Learning
In this section we study the geodesic convexity of functionals arising in the Bayesian formulation of semi-supervised classification. Our purpose is to illustrate the concepts in a tangible setting, and to show that establishing sharp levels of geodesic convexity may be more tractable for some functionals than others.
In semi-supervised classification one is interested in the following task: given a data cloud together with (noisy) labels for some of the data points , classify the unlabeled data points by assigning labels to them. Here we assume to have access to a weight matrix quantifying the level of similarity between the points in . Thus, we focus on the graph-based approach to semi-supervised classification, which boils down to propagating the known labels to the whole cloud, using the geometry of the weighted graph . We will investigate the existence and convergence of gradient flows for several Bayesian graph-based classification models proposed in Bertozzi et al. (2017). In the Bayesian approach, the geometric structure that the weighted graph imposes on the data cloud is used to build a prior on a latent space, and the noisy given labels are used to build the likelihood. The Bayesian solution to the classification problem is a measure on the latent space, that is then push-forwarded into a measure on the label space . This latter measure contains information on the most likely labels, and also provides a principled way to quantify the remaining uncertainty on the classification process.
Let then be a weighted graph, where is the set of nodes of the graph and is the weight matrix between the points in . All the entries of are non-negative real numbers and we assume that is symmetric. Let be the graph Laplacian matrix defined by
[TABLE]
where is the degree matrix of the weighted graph, i.e., the diagonal matrix with diagonal entries . The above corresponds to the unnormalized graph Laplacian, but different normalizations are possible Von Luxburg (2007). The graph-Laplacian will be used in all the models below to favor prior draws of the latent variables that are consistent with the geometry of the data cloud.
Remark 5.1**.**
A special case of a weighted graph frequently found in the literature is that in which the points in are i.i.d. points sampled from some distribution on a manifold embedded in , and the similarity matrix is obtained as
[TABLE]
In the above, is a compactly supported kernel function, is the Euclidean distance between the points and and is a parameter controlling data density. It can be shown (see Burago et al. (2013) and Garcia Trillos et al. (2017a)) that the smallest non-trivial eigenvalue of a rescaled version of the resulting graph Laplacian is close to the smallest non-trivial eigenvalue of a weighted Laplacian on the manifold, provided that is scaled with appropriately.
We will now study the probit and logistic models in subsection 5.1, and then the Ginzburg-Landau model in 5.2.
5.1 Probit and Logistic Models
Traditionally, the probit approach to semi-supervised learning is to classify the unlabeled data points by first optimizing the functional given by
[TABLE]
over all satisfying and then thresholding the optimizer with the sign function; the parameter is used to regularize the functions . The minimizer of the functional can be interpreted as the MAP (maximum a posteriori estimator) in the Bayesian formulation of probit semi-supervised learning (see Bertozzi et al. (2017)) that we now recall:
Prior: Consider the subspace and let be the Gaussian measure on defined by
[TABLE]
The measure is interpreted as a prior distribution on the space of real valued functions on the point cloud with average zero. Larger values of force more regularization of the functions .
Likelihood function: For a fixed and for define
[TABLE]
where the are i.i.d. and is the sign function. This specifies the distribution of observed labels given the underlying latent variable . We then define, for given data , the negative log-density function
[TABLE]
where is given by (5.1).
Posterior distribution: As shown in Bertozzi et al. (2017), a simple application of Bayes’ rule gives the posterior distribution of given (denoted by ):
[TABLE]
where is given by (5.2), and is given by (5.3).
From what has been discussed in the previous sections, the posterior can be characterized as the unique minimizer of the energy
[TABLE]
Let us first consider the gradient flow of with respect to the usual Wassertsein space (i.e. the one induced by the Euclidean distance).
We can study the geodesic convexity of this functional by studying independently the convexity properties of and of . Precisely:
- i)
Since is a Gaussian measure with covariance , Example 1 shows that is -geodesically convex in Wasserstein space, where is the smallest non-trivial eigenvalue of 2. ii)
The function is convex —see the appendix of Bertozzi et al. (2017). Hence, the functional is [math]-geodesicaly convex in Wasserstein space.
It then follows from Proposition 3.1 that is -geodesically convex in Wasserstein space. As a consequence, if we consider , the gradient flow of with respect to the Wasserstein distance starting at (an absolutely continuous measure with respect to ), geometric inequalities can be immediately obtained from (3.7); such inequalities will not deteriorate with —see Remark 5.1.
However, the diffusion associated to this flow is given by
[TABLE]
and in particular its drift (more precisely the term ) deteriorates as gets larger. Notice that if we wanted to control the cost of discretization by rescaling the Euclidean metric (as exhibited in Example 2), the geodesic convexity of the resulting flow would vanish as gets larger.
The previous discussion shows that the flow of in the usual Wasserstein sense does not produce a flow with good convergence properties that at the same time is cheap to discretize (robustly in ). This motivates considering the gradient flow of with respect to the Wasserstein distance induced by a certain constant metric . Indeed, inspired by Proposition 4.3, let us consider the constant metric tensor
[TABLE]
Since the metric tensor is constant, in particular its induced volume form is proportional to the Lebesgue measure and hence we can write
[TABLE]
On the other hand, from the discussion in Section 3.3.1 we know that the densities of the stochastic process
[TABLE]
correspond to to the gradient flow of the energy with respect to the Wasserstein distance induced by the metric , where is a Brownian motion on . This diffusion can be rewritten in terms of the standard Euclidean gradient and Brownian motion as
[TABLE]
after noticing that
[TABLE]
where for the second identity we have used the fact that is constant. How convex is the energy with respect to the Wasserstein distance induced by ? Since the metric tensor is constant it follows that
[TABLE]
where . Finally, due to the convexity of we deduce that
[TABLE]
We notice that in (5.6) appears as . This is a fundamental difference from (5.5) (where appears as ) with computational advantages, given that the eigenvalues of grow towards infinity.
Remark 5.2**.**
A carefully designed discretization of (5.6) induces the so called Langevin pCN proposal for MCMC computing (see Cotter et al. (2013)).
Remark 5.3**.**
In the above we have considered a probit model for the likelihood function. The ideas generalize straightforwardly to other settings, notably the logistic model
[TABLE]
where
[TABLE]
The convexity of for the logistic model (5.7) can be established by direct computation of the second derivative of
5.2 Ginzburg-Landau Model
We now present the Ginzburg-Landau model for semi-supervised learning. This model will provide us with an example of a functional whose geodesic convexity with respect to Wasserstein distance is not positive (and hence one can not deduce geometric inequalities describing the rate of convergence towards the posterior), but for which one can obtain a positive spectral gap giving the rate of convergence of the flow of Dirichlet energy in the sense.
Let
[TABLE]
We consider the following Bayesian model.
Prior:
[TABLE]
Likelihood function: For
[TABLE]
This leads to the following negative log-density function:
[TABLE]
Posterior distribution: Combining the prior and the likelihood via Bayes’ formula gives the posterior distribution
[TABLE]
where is given by (5.8), and is given by (5.9).
For this model, the negative prior log-density is not convex, and Wasserstein -geodesic convexity of the functional only holds for negative In particular, it is not possible to deduce exponential decay taking the Wasserstein flow point of view. However, in the /Dirichlet energy setting we can still show exponential convergence towards the posterior . Indeed, because the negative log-likelihood of satisfies:
[TABLE]
there exists some for which has a Poincaré inequality with constant (see Chapter 4.5 in Pavliotis (2014)). In this example we can say more, and in particular we are able to find a Poincaré constant that depends explicitly on , the smallest non-trivial eigenvalue of and .
Let and let be its convex envelope, i.e. let be the largest convex function that is below . It is straightforward to show that and that
[TABLE]
Consider now the probability measure with Lebesgue density
[TABLE]
and define and as in Remark 3.5 using and instead of . For any given we then have
[TABLE]
where the first inequality follows from the fact that and the second inequality follows directly from the fact that . It follows that
[TABLE]
where the last inequality follows from the fact that the negative log-likelihood of satisfies the Bakry-Emery condition with constant (see Chapter 4.5 in Pavliotis (2014)). Clearly, the Poincaré constant above is very large for small or for large (number of labeled data points). We also notice that the cost of discretization of the diffusion associated to this flow increases with (as in Section 5.1).
Remark 5.4**.**
A similar analysis can be carried out now using the constant metric
[TABLE]
More precisely, consider the flow of the Dirichlet energy
[TABLE]
with respect to . How convex is this functional? For every we have
[TABLE]
from where it follows that
[TABLE]
A similar remark to the one at the end of section 5.1 regarding the dependence in of the resulting diffusion applies here as well.
6 Conclusions and Future Work
The main contribution of this paper is to explore three variational formulations of the Bayesian update and their associated gradient flows. We have shown that, for each of the three variational formulations, the geodesic convexity of the objective functionals gives a bound on the rate of convergence of the flows to the posterior. As an application of the theory, we have suggested a criterion for the optimal choice of metric in Riemannian MCMC schemes. We summarize below some additional outcomes and directions for further work.
- •
We bring attention to different variational formulations of the Bayesian update. These formulations have the potential of extending the theory of Bayesian inverse problems in function spaces, in particular in cases with infinite dimensional, non-additive, and non-Gaussian observation noise. Moreover, they suggest numerical approximations to the posterior by restricting the space of allowed measures in the minimization, by discretization of the associated gradient flows, or by sampling via simulation of the associated diffusion.
- •
The variational framework considered in this paper provides a natural setting for the study of robustness of Bayesian models, and for the analysis of convergence of discrete to continuum Bayesian models. Indeed, the authors Garcia Trillos and Sanz-Alonso (2018), Garcia Trillos et al. (2017b) have recently established the consistency of Bayesian semi-supervised learning in the regime with fixed number of labeled data points and growing number of unlabeled data. The analysis relies on the variational formulation based on Kullback-Leibler prior penalization in equation (5.4).
- •
The results in the paper give new understanding of the ubiquity of Kullback-Leibler penalizations in sampling methodology. In practice Kullback-Leibler is often used for computational and analytical tractability. The results presented in section 3.3 show that Kullback-Leibler prior penalization leads to a heat-type flow and, therefore, to an easily discretized diffusion process. On the other hand, prior penalization leads to a nonlinear diffusion process.
Acknowledgments
We are thankful to Matías Delgadino for pointing to us the reference Ohta and Takatsu (2011) while participating in the CNA Ki-net workshop “Dynamics and Geometry from High Dimensional Data” that took place at Carnegie Mellon University in March 2017. We are also thankful to Sayan Mukherjee for the reference Zellner (1988). Finally, we thank the anonymous editor and referees for their immense help in improving the readability of our manuscript.
Appendix A Proof of Theorem 4.1
Notice that the measure can be rewritten as
[TABLE]
where
[TABLE]
From Proposition 3.1 the sharp constant for which is -geodesically convex in the -Wasserstein space is given by
[TABLE]
where and stand for the Ricci curvature and Hessian in the -metric. To establish the proposition, it suffices to show that for any given , is equal to the smallest eigenvalue of the matrix .
Let us start by recalling that the -Hessian of a function , denoted by , is the symmetric -tensor satisfying
[TABLE]
for every and every constant speed -geodesic curve with and . It is convenient to rewrite in terms of the Christoffel symbols of the metric , the Euclidean inner product, and the regular (Euclidean) gradient and Hessian of the function . Let be a constant speed -geodesic with , . We can then write:
[TABLE]
where and are the usual gradient and Hessian matrix of respectively. The acceleration of the curve can be written in terms of the Christoffel symbols. Namely, if we write in coordinates as
[TABLE]
the following system of second order ODEs holds:
[TABLE]
Plugging the geodesic equations back into (A.1), and setting , it follows that
[TABLE]
where is the matrix with coordinates
[TABLE]
Hence,
[TABLE]
Taking we can write the coordinate of the matrix as
[TABLE]
Using now the fact that the -divergence of the vector field can be written in terms of the Christoffel symbols as
[TABLE]
we deduce that
[TABLE]
On the other hand, the Ricci curvature can be written in terms of the Christoffel symbols and its derivatives (alternatively in terms of the metric and its first and second order derivatives) as
[TABLE]
where is the (symmetric) matrix with entries
[TABLE]
see do Carmo Valero (1992) for details. After some cancellations using the symmetry of the symbols, we obtain that
[TABLE]
and so
[TABLE]
Using (A.2) and (A.3) we deduce that
[TABLE]
Therefore the variational problem (for every fixed )
[TABLE]
can be rewritten, applying the change of variables as
[TABLE]
In turn this coincides with
[TABLE]
i.e., the smallest eigenvalue of the matrix
[TABLE]
This concludes the proof of the first part of the theorem.
Now we show the scaling property (4.4). Let . By definition
[TABLE]
where and are defined as in (4.2) and (4.3) but in terms of the metric From the expression (4.1) for the Christoffel symbols, it follows that they are invariant under rescaling of the metric and, since and depend on the metric only through the symbols, we deduce that , Therefore,
[TABLE]
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ambrosio et al. (2008) Ambrosio, L., Gigli, N., and Savaré, G. (2008). Gradient flows: in metric spaces and in the space of probability measures . Springer Science & Business Media.
- 2Attias (1999) Attias, H. (1999). “A Variational Bayesian Framework for Graphical Models.” In NIPS , volume 12.
- 3Bertozzi et al. (2017) Bertozzi, A. L., Luo, X., Stuart, A. M., and Zygalakis, K. C. (2017). “Uncertainty quantification in the classification of high dimensional data.”
- 4Besag (1994) Besag, J. E. (1994). “Comments on “Representations of knowledge in complex systems” by U. Grenander and M. I. Miller.” J. Roy. Statist. Soc. Ser. B , 56: 591–592.
- 5Burago et al. (2001) Burago, D., Burago, Y., and Ivanov, S. (2001). A course in metric geometry , volume 33 of Graduate Studies in Mathematics . American Mathematical Society, Providence, RI. URL https://doi.org/10.1090/gsm/033 · doi ↗
- 6Burago et al. (2013) Burago, D., Ivanov, S., and Kurylev, Y. (2013). “A graph discretization of the Laplace-Beltrami operator.” ar Xiv preprint ar Xiv:1301.2222 .
- 7Cotter et al. (2013) Cotter, S. L., Roberts, G. O., Stuart, A. M., and White, D. (2013). “MCMC methods for functions: modifying old algorithms to make them faster.” Statistical Science , 28(3): 424–446.
- 8do Carmo Valero (1992) do Carmo Valero, M. P. (1992). Riemannian Geometry .
