Sampling with Barriers: Faster Mixing via Lewis Weights
Khashayar Gatmiry, Jonathan Kelner, Santosh S. Vempala

TL;DR
This paper improves the mixing rate bounds for Riemannian Hamiltonian Monte Carlo sampling of polytopes by introducing a hybrid barrier, leveraging new geometric analysis and extending self-concordance concepts.
Contribution
It introduces a hybrid Lewis weights and log barrier for RHMC, achieving faster mixing bounds and developing new geometric analysis tools for Markov chains on manifolds.
Findings
Mixing rate improved to O(m^{1/3} n^{4/3})
Developed a framework for analyzing Hamiltonian curves on Riemannian manifolds
Extended self-concordance to the infinity norm for sharper bounds
Abstract
We analyze Riemannian Hamiltonian Monte Carlo (RHMC) for sampling a polytope defined by inequalities in endowed with the metric defined by the Hessian of a convex barrier function. The advantage of RHMC over Euclidean methods such as the ball walk, hit-and-run and the Dikin walk is in its ability to take longer steps. However, in all previous work, the mixing rate has a linear dependence on the number of inequalities. We introduce a hybrid of the Lewis weights barrier and the standard logarithmic barrier and prove that the mixing rate for the corresponding RHMC is bounded by , improving on the previous best bound of (based on the log barrier). This continues the general parallels between optimization and sampling, with the latter typically leading to new tools and more refined analysis. To prove our main results, we have toā¦
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMarkov Chains and Monte Carlo Methods Ā· Topological and Geometric Data Analysis Ā· Statistical Methods and Inference
Sampling with Barriers: Faster Mixing via Lewis Weights
Khashayar Gatmiry, Jonathan Kelner, Santosh S. Vempala MIT, [email protected]. Part of this work was done while visiting Georgia Tech and supported by NSF award CCF-2007443.MIT, [email protected] Tech. [email protected]. Supported in part by NSF awards CCF-2007443 and CCF-2106444.
Abstract
We analyze Riemannian Hamiltonian Monte Carlo (RHMC) for sampling a polytope defined by inequalities in endowed with the metric defined by the Hessian of a convex barrier function. The advantage of RHMC over Euclidean methods such as the ball walk, hit-and-run and the Dikin walk is in its ability to take longer steps. However, in all previous work, the mixing rate has a linear dependence on the number of inequalities. We introduce a hybrid of the Lewis weights barrier and the standard logarithmic barrier and prove that the mixing rate for the corresponding RHMC is bounded by , improving on the previous best bound of (based on the log barrier). This continues the general parallels between optimization and sampling, with the latter typically leading to new tools and more refined analysis. To prove our main results, we have to overcomes several challenges relating to the smoothness of Hamiltonian curves and the self-concordance properties of the barrier. In the process, we give a general framework for the analysis of Markov chains on Riemannian manifolds, derive new smoothness bounds on Hamiltonian curves, a central topic of comparison geometry, and extend self-concordance to the infinity norm, which gives sharper bounds; these properties appear to be of independent interest.
Contents
-
6.2 High probability bound on norms along the Hamiltonian curve
-
D.5 Norm comparison between covariant and normal derivatives
1 Introduction
Generating nearly uniform random samples from a high-dimensional polytope is a fundamental algorithmic problem with a rich history and powerful applications, notably including the only known fully polynomial-time approximation schemes for computing a polytopeās volume. All efficient algorithms known for this problem work by designing a Markov chain whose stationary distribution is uniform over the polytope and showing that it mixes in a small number of steps.
In this paper, our main result is that we can construct such a Markov chain with an improved bound on its mixing time. For a polytope given by linear inequalities in , we describe chain that mixes in steps, improving on the best previous bound of . This allows us to approximate the volume within relative error using steps, which is a similar improvement over the best existing bound of .
1.1 Background and Related Work
In their seminal workĀ [10], Dyer, Frieze and Kannan gave the first polynomial-time algorithm for this problem, as well as for the more general problem of sampling from a convex body specified by a membership oracle. The Markov chain in their algorithm was a grid walk, which takes steps along the edges of the graph obtained by intersecting the convex body with a discrete grid supported on for some . This graph is heavily dependent on the coordinate systemāits diameter is proportional to the diameter of the convex body, and its conductance can be arbitrarily small if the convex body is scaled so that is very long in some directions but short in others. However, they showed that, if one changes to a basis in which the convex body is appropriately āwell-rounded,ā the grid walk mixes in polynomial time and that one can use a random sample from the grid to obtain a one from the convex body.
The polynomial for the mixing time inĀ [10] was quite large, and a sequence of later papers improved this by modifying the Markov chains and refining the analysis. Because one often wants to draw many samples from the body, these papers typically provide two bounds on the number of steps required: a bound when starting from an arbitrary point and including the cost of any preprocessing; and a bound when given a warm start, where the preprocessing has already been performed and the starting point is drawn from a distribution that is not too far from uniform.
InĀ [13], Kannan, LovĆ”sz, and Simonovits showed that a ball walk whose steps are chosen uniformly from a Euclidean ball around the current point mixes in steps from a warm start and steps from an arbitrary starting point and including preprocessing. Later, LovĆ”sz and VempalaĀ [24] studied the āhit-and-runā walk, which chooses a line in a random direction from the current point and then picks the next point randomly from the intersection of this line with the body, and they showed it also mixed in steps from a warm start but needed only steps for first sample and preprocessing. These algorithms work on general convex bodies presented by oracles, but like the grid walk, they are strongly coordinate dependent, and they thus require strong additional assumptions about the coordinate system. In particular, analyses of these algorithms typically assume that body is close to isotropic, i.e., that the covariance matrix of a random sample from the body is approximately the identity, and applying these algorithms to more general bodies requires costly preprocessing.
The dependence on the coordinate system in the aforementioned Markov chains comes from the dependence of the transition probabilities on the extrinsic geometry of the ambient Euclidean space. The impact of this extends beyond the overhead from the isotropy requirements. The geometry of the ambient space does not incorporate any information about how close a point is to the boundary, which typically leads to difficulties making progress with steps near the boundary. For example, if one is running a ball walk with step radius an -dimensional cube, and the current point is some distance from one of the corners, a random point from the radius ball will lie outside the cube with probabability exponentially close to 1, so naively trying random points until obtaining one in the cube would take a large number of tries. Moreover, even if one could sample a random point in the intersection of the ball with the cube, restricting the step to points inside the cube would distort the stationary distribution, and it would no longer be uniform. Remedying such difficulties typically involves (depending on the paper) some combination of taking smaller steps, enlarging the convex body (and failing if the walk ends up at a point outside the original body), and employing rejection sampling or a Metropolis filter to correct the stationary probabilities, all of which increase the required number of steps.
For polytopes specified by an explicit collection of linear constraints, one can use the barrier functions employed by interior point methods to design random walks whose steps depend only on the intrinsic geometry of the polytope and are independent of the basis chosen for the ambient space. The idea behind these random walks is to use the Hessian of the barrier function to define a local norm/Riemannian metric on the interior of the polytope and specify the steps in terms of the resulting geometry. This mitigates some of the problems described above and has led to Markov chains whose mixing times grow with the number of constraints but depend more mildly on the dimension.
In the first such work, Kannan and NarayananĀ [14] introduced the Dikin walk and gave a mixing time bound of from a warm start for a polytope with facets in . This walk is similar to the ball walk, but it chooses its steps from Dikin ellipsoids, which are balls with respect to the Hessian of the standard logarithmic barrier function on the polytope. In [16], Laddha, Lee, and Vempala studied the analogous walk with respect to any self-concordant barrier and showed that it mixes in steps, where is a parameter they called the barrier parameter. By bounding this parameter for a different barrier function (a variant of a barrier due to Lee and Sidford [18]), they obtained an improved mixing rate bound of .
In 2017, Lee and VempalaĀ [20] reduced the mixing rate to using a process they called the geodesic walk. Like in the Dikin Walk, the steps are constructed using the Hessian of a barrier function. However, instead of using this to define a Euclidean ellipse, they use it to define a Riemannian metric, and they then solve a differential equation in each step to follow geodesics on the resulting manifold. These geodesics tend to curve away from the polytopeās boundary, which lets them take longer steps in each iteration.
In 2018, Lee and VempalaĀ [21] improved this to using Riemannian Hamiltonian Monte Carlo (RHMC)Ā [11], which is the class of processes weāll use in this paper. While there is a large literature on using RHMC and related methods to sample smooth densitiesĀ [7, 9, 5, 29, 22, 4], there are relatively few provable results about applying it in constrained non-smooth settings like polytope sampling. Roughly speaking, this improvement over the geodesic walk came from RHMCās ability to avoid the use of a Metropolis filter, which the geodesic walk requires in order to obtain the correct stationary distribution (even when the target distribution is uniform). RHMC chooses its trajectories according to a different differential equation that, remarkably, yields a reversible random walk with the desired stationary distribution, thus eliminating the need for a Metropolis filter and allowing greater progress in each step.
Advances in self-concordant barriers in the past decade as well as the improvement in the analysis of the Dikin walk suggest that a smaller dependence on , the number of inequalities, which can be much higher than the dimension, should be possible. Nevertheless, improving on the bound of has been a major open problem for the past 5 years. Moreover, the new techniques developed as a result of progress on non-Euclidean algorithms suggest that this is a fertile area for further TCS research.
1.2 Background on Riemannian Hamiltonian Monte Carlo
The motivation for RHMC comes from the Hamiltonian formulation of classical Newtonian mechanics. Hamiltonian mechanics parameterizes a physical system in terms of a position vector and a corresponding momentum vector (which is also referred to as āvelocityā in some prior work on sampling polytopes with RHMC). The physics of the system are encoded in its Hamiltonian , which is simply the energy of the system written as a function of and , and its time evolution is determined by Hamiltonās equations:
[TABLE]
With the appropriate choice of , these reproduce Newtonās laws of motion, but they also generalize quite broadly, including to Riemannian manifolds.
In RHMC, one defines a Markov chain by choosing a Hamiltonian that appropriately encodes the target distribution. At each step, the Markov chain chooses a random momentum vector and then finds the next point by numerically solving a differential equation to follow the trajectory given by Hamiltonās equations.
One can show that the value of the Hamiltonian (i.e., the energy) and the volume element in the space of pairs are conserved along the trajectory, which can be used to show that the trajectories are preserved by time reversal (i.e., running time backwards). One can then use this to show that, if one uses the Hamiltonian defined below, the marginal distribution of will converge to the desired target distribution without requiring a Metropolis filter. (SeeĀ [11] for the derivation for general RHMC andĀ [21] for the specific class of Hamiltonians given below.)
More precisely, let the Hamiltonian at a point for a vector be defined as
[TABLE]
where is a positive definite matrix defining a Riemannian metric at each point as , and the target density to be sampled is proportional to restricted to the support of . One step of RHMC consists of the following: first pick from the Gaussian . Then for time follow the Hamiltonian curve jointly on :
[TABLE]
The final at time is the sampled point from the Markov Kernel. A natural choice for the metric turns out to be the Hessian of a self-concordant barrier function inside the polytope . The standard logarithmic barrier, , was used in [21] to prove that the resulting RHMC mixes in steps. Improving on this bound is our motivating open problem.
Using the log barrier implies that the mixing rate has a linear dependence on , the number of inequalities. So we have to look for a ābetterā barrier, and what exactly this entails will become clear presently. As we will see below, the barrier parameter of the self-concordant function, which is for the logarithmic barrier, plays an important role in the mixing time of this Markov chain. Given that there are efficiently-computable barriers for which this parameter is Ā [18], one might hope to obtain faster mixing by simply replacing the logarithmic barrier with one of these. However, it turns out that just bounding the barrier parameter is insufficient, and we need to choose a barrier that also possesses certain stronger smoothness and stability properties. One of our primary technical challenges will be to define a notion that is stringent enough to guarantee the stronger properties required while still admitting a construction that improves upon the logarithmic barrier.
1.3 Results
In this paper, we use a hybrid barrier based on the Lewis weight barrier defined as
[TABLE]
where is a diagonal matrix whose diagonal entries are the -Lewis weights of the rescaled matrix and is the diagonal matrix whose entries are the slacks at point , i.e., .
We define a hybrid barrier for a polytope as follows.
Definition 1** (Hybrid barrier).**
We define the hybrid barrier inside a polytope as
[TABLE]
where are the slacks at point . We denote the normalizing factor of by .
For background on Lewis weights see SectionĀ 2. Our main theorem is a bound on the mixing rate of RHMC with this hybrid barrier.
Theorem 1.1** (Mixing).**
Given a polytope , let be the distribution with density proportional to over the open set inside . Then, RHMC with stationary distribution on the manifold of the open set inside equipped with metric defined by the Hessian of the hybrid barrier with has mixing rate bounded by
[TABLE]
In particular, for the uniform distribution over (with ), the mixing rate is
[TABLE]
More specifically, the Markov chain starting at reaches with TV-distance at most to the target after
[TABLE]
steps, where and hide factors.
Note that without a warm start, the dependence in TheoremĀ 1.1 could be another factor of to the mixing time. However, applying the Gaussian Cooling frameworkĀ [6] extended to manifoldsĀ [21] lets us sample from for any without a warm start penalty, and also allows us to compute the volume of the polytope without a significant overhead.
Corollary 1.1.1** (Any start; Volume).**
For the manifold Gaussian Cooling scheme inĀ [21] with the hybrid barrier (4) applied to sample from the density inside a given polytope starting from , the total number of RHMC steps for any is bounded by
[TABLE]
Moreover, to compute the integral of in the polytope and in particular the volume of the polytope up to multiplicative error , the total number of RHMC steps is bounded by .
This improves on the previous best bound of due to [21] based on the standard logarithmic barrier. The proof of TheoremĀ 1.1 requires the development of several technical ingredients. We summarize a few that are likely to be of independent interest.
The first is a new isoperimetric inequality for this hybrid barrier (see SectionĀ 2.2 for the definition of isoperimetry).
Theorem 1.2**.**
[Isoperimetry of Hybrid Barrier] Let be a metric corresponding to Hessian of the hybrid barrier, with support given by a polytope defined by inequalities in .
Then for , the distribution with density proportional to has isoperimetric constant at least
[TABLE]
As part of the proof, we develop stronger self-concordance properties of the Lewis weight barrier. The usual self-concordanceĀ [25] for barrier implies a control on the third order derivative of by its second derivative, which can be seen as a property of the metric ,
[TABLE]
where is the directional derivative of along direction . We will need to extend this self-concordance to third-order derivatives of . These types of estimates for the derivatives of the metric are known as Calabi estimates in the Differential Geometry literatureĀ [27, 30].
Lemma 1.3** (Manifold self-concordance of Hybrid barrier).**
The hybrid barrier is third-order self-concordant with respect to the manifoldās metric , namely
[TABLE]
Here is the Löwner ordering between matrices ignoring logarithmic factors. The Calabi-type estimates in Lemma 1.3 turn out to be insufficient to improve the mixing rate. Hence, as one of our main contributions, we develop a new type of self-concordance, where instead of the local norm , we measure the spectral change of the metric in a different local norm . An intuitive description of is via its unit ball; namely, is the unique norm whose unit ball is the symmetrized polytope around , as illustrated in Figure 1(a). ( is the reflection of around .)
Lemma 1.4** (Infinity norm Third-order Self-concordance of Hybrid barrier).**
The hybrid barrier, defined inĀ (4), is third-order self-concordant with respect to the local infinity norm . Namely,
[TABLE]
In fact, the norm measures the ratio of the change of the distance to the th facet after taking step divided by the distance to facet , then taking maximum of this ratio over all facets. These estimates will allow us to prove important smoothness properties of certain quantities on the manifold that we are interested in. In the following, we sometimes refer to our notion of strong third-order self-concordance as infinity norm self-concordance, as it involves the local norm .
1.4 Technical overview
Mixing and Conductance.
Our general approach to bounding the mixing rate is based on bounding the conductanceĀ [23]. The standard approach to bounding the conductance of geometric walks of this type is to show an isoperimetric inequality for the underlying metric space and then prove that steps of the random walk behave well with respect to the underlying metric. Formally, we show two properties for the manifold obtained by equipping the interior of the polytope with the metric :
- ā¢
Isoperimetry. The target density has a good isoperimetry constant on .
- ā¢
One-step Coupling. The one-step distributions of the Markov chain given two close-by points on the manifold are close in TV-distance. Namely, for some parameter , after excluding a tiny set , given any two points with we show
[TABLE]
where denotes the Markov kernel starting from .
Isoperimetry.
The log barrier metric gives an isoperimetric coefficient of , which leads to a factor of in the conductance. In principle, this can be improved to by using a barrier with barrier parameter , as the general bound on the isoperimetry is for any strongly self-concordant barrier with barrier parameter Ā [17]. While the universal and entropic barriers have , they are expensive to compute. The LS barrierĀ [18] has while being efficient to compute. However, as we will see in more detail, as far as we know, the metric and its derivatives are not āsmoothā enough in most of the directions in the tangent space, which means we would have to take rather small steps while running RHMC.
We will prove that the hybrid barrier has significantly better isoperimetry (Thm.Ā 1.2) than the log barrier while maintaining sufficient smoothness.
Smoothness of Hamiltonian Curves and Comparison Geometry.
The starting point of our analysis is the fact that one can look at the ordinary differential equation of RHMC in EquationĀ (2) as a second-order ODE on the manifold of the open set inside the polytope with metric . We will introduce this alternative form shortly. Looking at the Markov Kernel of RHMC for a fixed point , the randomness to define this kernel comes from the initial velocity , which can be viewed as a vector on the tangent space of on the manifold distributed as a standard Gaussian with respect to the local metric, namely in the Euclidean chart. In order to show the One-step Coupling (Lemma Ā 6) for the Markov kernel of RHMC, we bound the difference between the densities and at a given point on the manifold. These densities are the pushforwards of the Gaussian density in the tangent space of and respectively, onto the manifold through the Hamiltonian map for some fixed time , which maps the initial velocity to the solution of the ODE at time . The key to bound the change of density is to understand how the Hamiltonian curves vary as we change the initial point from to for a fixed destination , given the particular geometry imposed by our hybrid barrier inside a polytope. In fact, understanding the extremal scenarios of the behavior of geometric quantities on a certain class of manifolds is the topic of Comparison GeometryĀ [3]Ā [26]Ā [2]. In particular, to argue that the Hamiltonian curve changes sufficiently slowly, we need the metric of the manifold and its derivatives to be āstableā. The simplest form of stability of the metric is the so-called self-concordance property, namely, is self-concordant if the derivative of in a unit direction in the tangent space is controlled by itself. This type of self-concordance for the first derivative of the metric is already known for the -Lewis weights barrierĀ [19]. However, this notion of stability is too weak for our use since a typical Gaussian vector in the tangent space of has norm of order . Nonetheless, one can hope to obtain estimates for with respect to a different norm whose value is typically much smaller than the norm. We show that self-concordance of the metric of the -Lewis weights barrier for with respect to the infinity norm of a re-parameterized version of is effective for characterizing the stability of Hamiltonian curves. This local infinity norm, which we denote by , can be regarded as the maximum ratio of the length of projected onto the normal of a facet divided by the distance of from that facet; its unit ball is the symmetrized polytope around . Importantly, one can see that for a typical Gaussian vector , is of order instead of . In fact, the norm of the tangent vector to the RHMC curve remains small for all times with high probability. This is favorable as we need a bound on the rate of change of the density only for typical values of and can ignore sets with small probability in bounding the conductance. An important part of our contribution is to derive self-concordance estimates for the derivatives of the metric of the -Lewis weights for up to third order, with respect to this local norm. We introduce our approach up to second order self-concordance in SectionĀ 3 and defer the third-order self-concordance to AppendixĀ C. Although the number of terms that are created from differentiating the Lewis weights metric up to third order grows quite large, many subtensors are common, which enables us to treat in a similar fashion. To avoid repetition, we gather the common Lƶwner inequalities that we use for various matrices in sectionĀ D which we reuse to prove the self-concordance of the Lewis weights barrier. The infinity norm third-order self-concordance of the hybrid barrier follows from combining the infinity norm third-order self-concordance of the -Lewis weights barrier and the log barrier (see sectionĀ 3).
The threshold is essential to obtain our estimates. In particular, we can still control the derivative of the metric with respect to for the LS barrier, which is a Lewis weights barrier for polylogarithmically large , but it is an overestimate of the norm with high probability for a Gaussian vector in the tangent space of . Nonetheless, for small ās the ellipsoid of the -Lewis weights does not approximate the symmetrized polytope as well as larger ās; in particular a large portion of the ellipsoid lies outside the symmetrized polytope. This means that we need to scale down the unit norm ellipsoid so that it fits inside the polytope, which then means we have to to scale it up by a larger constant to make it contain the symmetrized polytope. As a result, the barrier parameter is large (seeĀ [16] for definition of barrier parameter), which in turn results in a poor isoperimetric constant.
We would like to have an ellipsoid at each point inside the polytope that approximates the symmetrized polytope around more accurately and is also stable as moves in random directions. For this, we go back to an idea of Vaidya from optimization and use a hybrid barrier by āregularizingā the -Lewis weight barrier for with the standard log barrier We can give a better bound on the barrier parameter of this hybrid barrier compared to the log barrier, which implies that the corresponding metric has better isoperimetry. Moroever, the regularization does not harm the stability of the metric as the log barrier already enjoys stability with respect to the local infinity norm . In particular, we show that our hybrid barrier has stable higher-order derivatives in arbitrary directions based on the local norm . The particular choice of our barrier is essential to simultaneously prove third order infinity-norm self-concordance and good isoperimetry.
Hamiltonian curves and variations.
To see the high-level idea of how we show the one-step coupling of the Markov kernel, consider the shortest path between two points and , which is a geodesic on the manifold. Geodesics are generalization of straight lines in the Euclidean space to arbitrary manifolds and naturally define the curve with the smallest possible length between two points on the manifold. Let the curve , parameterized by , be a length-minimizing geodesic connecting to with distance . Suppose that running the Hamiltonian ODE with initial location and initial velocity up to time takes us to a point on the manifold. As we start moving toward on the geodesic, parameterized by , we consider the variation of the initial Hamiltonian curve; namely a family of Hamiltonian curves parameterized by , where the -curve starts from point , perhaps with a different initial velocity , but ends up to the same destination at time . The geodesic from to and the corresponding Hamiltonian curves are illustrated in FigureĀ 2.
Looking at the the value of the density at point after taking one step of the Markov chain starting from , we observe it depends on two major components: (1) the Gaussian density of the initial velocity which is proportional to , and (2) the determinant of the Jacobian or the differential of the map from the initial velocity to the destination point , denoted by . Therefore, to study how quickly the density changes from to , we need to study the rate of change of the initial velocities and the Jacobians ; the latter will depend on the rate of change of the Ricci tensor on the manifold. To study the variation of the Hamiltonian curve, we start by defining these manifold concepts.
As we mentioned earlier, one can identify the location variable in the Hamiltonian ODEĀ (2) as a point on the manifold with metric , and the velocity variable as a vector in the tangent space of , . Then, one can write the Hamiltonian ODE in EquationĀ (2) as a second-order ODE on the manifold using the covariant derivative of , illustrated in LemmaĀ 1.5. For background on Riemannian geometry and covariant differentiation, we refer the reader to AppendixĀ A.
Lemma 1.5**.**
The Hamiltonian ODE in EquationĀ 2 can be written using the covariant derivative of the manifold in a simplified form:
[TABLE]
Above, is the covariant derivative and is the bias (drift) vector field of the Hamiltonian curve, defined as
[TABLE]
In the above notation, is a vector whose th entry is . See AppendixĀ B for a proof of LemmaĀ 1.5. The above ODEĀ (7) for Hamiltonian curves is similar to the second order ODE for geodesics; for the latter the bias vector is zero, i.e., the geodesic Equation is given byĀ [8]
[TABLE]
In physics, the Hamiltonian ODE in EquationĀ 7 is important as it models the motion of a particle on a manifold acting under a force field devised by . Next, we define the notion of a family of Hamiltonian curves.
Definition 2** (Family of Hamiltonian curves).**
We say \big{(}\gamma_{s}(t)\big{)} is a family of Hamiltonian curves ending at some fixed whose starting point varies from to if for every fixed time , is a Hamiltonian curve in , and as a function of is a geodesic on from to . Unless specified otherwise, whenever we talk about the curve we mean the curve as a function of for a fixed . We write to refer to the derivative of the curve with respect to .
Before studying the variations of Hamiltonian fields, to given some high level intuition, we start by variations of geodesics here. More precisely, suppose is a variation of geodesics, i.e. is a geodesic in for every fixed (recall that the curve in parameter is also a geodesic from to ). For brevity, we sometimes refer to the curve by . To see how fast the geodesics changes as a function of at time , for a fixed we take the derivative of with respect to at time ; this gives us a vector field along :
[TABLE]
This vector field, called a Jacobi field, is a fundamental object in studying the variations of geodesics. Importantly, one can write a second-order ODE to describe how evolves along the geodesic given initial conditions
[TABLE]
where the second derivative is the covariant derivative on the manifold with respect to , i.e., , and is the Riemann tensor. We will provide some intuition on the role of Riemann tensor and its role in the behavior of geodesics presently. An important point to observe here is that the covariant derivative of at is equal to the covariant derivative of the initial velocity of the geodesic, namely , with respect to (see LemmaĀ A.4 for a proof):
[TABLE]
So the initial values that uniquely specify the Jacobi field are , which specifies how fast we change the starting point of the geodesic, and , which is how fast we change the initial velocity of the geodesic. This means that one can study the Jacobi field ODE to obtain estimates on how fast the initial velocity should change along the geodesic from to , for this family of Hamiltonian curves with the same destination . Now consider a direction perpendicular to the velocity of the geodesic at time , i.e., . Looking at the dot product of the vector on the right hand side of the Jacobi field ODE inĀ (10) to itself, the quantity is intuitively measuring how much the Jacobi field is growing or shrinking in direction , meaning whether the geodesics parameterized by are converging or diverging in direction at time . This quantity is known as the sectional curvature of the plane spanned by and . Now consider a unit orthonormal parallelepiped at time , denoted by a set of orthonormal vectors in the tangent space of , where , and look at the evolution of its volume along the geodesic when each evolves according to the Jacobi Equation; in each directions , the parallelepiped is either expanding or squeezing, depending on if the geodesics are converging or diverging in that direction which depends on the sign of the sectional curvature . Indeed, one can characterize the rate of change of this parallelepiped along the geodesic by summing the sectional curvatures for all ; this is the Ricci curvature of the manifold at in the direction :
[TABLE]
Note that the Ricci curvature is nothing but the trace of the Riemann tensor . On the other hand, the determinant of the Jacobian of the Hamiltonian map, a quantity of our interest to bound the change of density from to , can be characterized by the ratio of the volume of this parallelepiped at the beginning and the ending time . Indeed, we see later on that the log determinant of can be written as a time-weighted integral of the Ricci curvature along the geodesic.
One can extend these arguments to variations of Hamiltonian curves instead of geodesics. As a result, instead of the Riemann tensor in the Jacobi fields EquationĀ (10), we end up with a slightly different operator which can be decomposed into a āgeometric part,ā the Riemann tensor, and a ābias part,ā , which comes from the derivative of the Hamiltonian bias , defined in EquationĀ (8). We define this fundamental operator rigorously.
Definition 3** (Operators and ).**
At any point , we define the operator as
[TABLE]
where is the covariant derivative on the manifold and is the Hamiltonian bias. Given the Hamiltonian curve , we define the operator on the tangent space as
[TABLE]
where is the Riemann tensor.
Similar to Jacobi fields, for a given family of Hamiltonian curves , one can write a second order ODE for the variational vector field along the Hamiltonian curve, which depends on operator (for the proof see AppendixĀ B):
Lemma 1.6** (ODE for Hamiltonian fields).**
Given a family of Hamiltonian curves \big{(}\gamma_{s}(t)\big{)}, the vector field \tilde{J}(t)\triangleq\partial_{s}\gamma_{s}(t)\Big{|}_{s=0} is characterized by the following second order ODE:
[TABLE]
where is defined in Ā 3. We refer to as a Hamiltonian field.
The difference between the ODE of Hamiltonian fieldsĀ 12 and that of Jacobi fieldsĀ 10 comes from the fact that the primary Hamiltonian EquationĀ (7) includes an additional bias vector compared to the geodesic EquationĀ (9).
Now similar to the case of variations of geodesics, for variation of Hamiltonian curves, the log determinant of the Jacobian of the Hamiltonian map can be characterized by a weighted integral of the trace of instead of the Ricci tensor. Therefore, to study the rate of change of as we move from to , we need to study the rate of change of along the variation of Hamiltonian curves , which in turn depends on the rate of change of the Ricci tensor and the trace of operator , the two parts of the operator . These ideas are formalized as the -normality of the Hamiltonian curve in the definition below.
Definition 4**.**
We say a Hamiltonian curve is -normal up to time if for all if it satisfies the following:
- ā¢
Bound on the Frobenius norm of (with respect to the metric ):
[TABLE]
- ā¢
For all times and unit direction in the tangent space of :
[TABLE]
- ā¢
For defined as the parallel transport of along the curve:
[TABLE]
Parallel transport of a vector on the manifold is a generalization of shifting vectors in Euclidean space, using the covariant derivative of the manifold (see AppendixĀ A for the rigorous definition.) In order to show the -normal property for the family of Hamiltonian curves, we need to define a more fundamental regularity condition for the Hamiltonian curves which states that both and norms remain small for the tangent vector along the Hamiltonian curve.
Definition 5** (Nice Hamiltonian curve).**
We say a Hamiltonian curve is -nice if for :
[TABLE]
In order to show the closeness of one step distributions between and , we need the -normality for the family of Hamiltonian curves for all as we defined inĀ 4. Therefore, we need to show that the -niceness property is stable for our hybrid barrier. We show this in LemmaĀ 1.7, proved in SectionĀ 6. Our -niceness framework is a simpler and more general framework and avoids the technical machinery of auxiliary functions on curves used inĀ [21], which needs additional parameters that need to be bounded.
Lemma 1.7** (Stability of norms).**
In the same setting as TheoremĀ 1.8, given a family of Hamiltonian curves for which is -nice for
[TABLE]
then is a -nice family of Hamiltonian curves in the interval .
A major part of our contribution is that we relate this abstract notion of -normality to (a generalized notion of) metric self-concordance or Calabi-type estimates, which (1) crucially uses a different notion of norm to bound the derivatives of the metric and (2) needs to be satisfied for higher derivatives of the metric up to third order. Our framework can potentially be reused on other manifolds and distributions.
Theorem 1.8** (Smoothness).**
Given a Hessian manifold defined by the metric for our hybrid barrier (see DefinitionĀ 1) for , define a Hamiltonian curve by the ODE in EquationĀ (7) with target log density . Assume that is -nice (see definition of niceness inĀ 5), then it is also -normal with parameters
[TABLE]
Proof.
The result follows from the key LemmasĀ 5.1,Ā 5.5, andĀ 5.15. ā
To understand the effect of self-concordance on the density of the push-forward measure, note that the more slowly the metric changes, the more slowly the geodesics will converge or diverge from one another, so we have smaller scalar and Ricci curvatures. As an example, one can see that the Ricci curvature can be written formally using the metric and its first derivative on Hessian manifolds (see EquationĀ (90)). As a result, the rate of change of the Ricci tensor, which corresponds to the parameter in DefinitionĀ 1.8, depends on the derivatives of the metric up to second order, and in particular can be bounded efficiently given that the metric satisfies some form of second-order self-concordance. In this regard, a question that comes up is the following: in which norm should we measure the self-concordance of the metric?
A key to notice here is that in measuring the change of , the Ricci tensor itself involves the change of the metric in a random direction as we can show that , the tangent of the Hamiltonian curve, is distributed as a Gaussian. Now if one uses the conventional framework of self-concordance in optimization which measures the derivative of the metric in direction with respect to its local norm , then the typical value of the quantity is of order . This indicates a major reason we choose to measure self-concordance in the norm, which is for a typical Gaussian vector . Importantly, we use our third-order infinity norm self-concordance in LemmaĀ 1.4 in a black-box manner to show the -normality of the Hamiltonian curve. On the other hand, even though the log barrier satisfies this type of self-concordance with respect to , it does not approximate the local geometry of the polytope well, which results in poor isoperimetry and slow mixing. For this reason, we develop infinity-norm self-concordance for the -Lewis weights barrier whose local ellipsoids are better approximations for the symmetrized polytope. Our approach to develop the infinity norm self-concordance estimates crucially depends on . Therefore, to further enhance the isoperimetry of the metric, we regularize the Lewis weights barrier with the log barrier, which results in our final hybrid barrier in EquationĀ (4).
Structure of the paper.
The rest of the paper is organized as follows: In SectionĀ 2 we discuss the basic tools and notation that we use throughout the paper. In SectionĀ C, we give our proof of second-order infinity norm self-concordance estimates for the Lewis weights barrier (we defer the proof of strong third-order self-concordance to AppendixĀ C). In SectionĀ 4, we bound the mixing time by combining multiple components, namely the stability of the Hamiltonian curves, the isoperimetry of the stationary distribution with respect to the chosen metric, and the smoothness of the manifold with our hybrid barrier. We relate the change of density of the Markov kernel between two points to the smoothness of the manifold. In SectionĀ 5, we show how we use infinity-norm third-order self-concordance to control the smoothness of the metric. Namely, we bound the norm of an important operator related to the Riemann tensor and the Hamiltonian potential, which appears in the ODE of variations of Hamiltonian curves (parameter ). To bound the determinant of the Jacobian of the RHMC map, which is a component in the pushforward density of the Gaussian distribution in the tangent space onto the manifold, we bound the rate of change of the trace of , which includes the Ricci tensor (parameter ) and another component originating from the Hamiltonian bias . Finally, we bound the norm of applied to the initial velocity of the Hamiltonian curve parallel transported along the curve. In SectionĀ 6, we prove the stability of the smoothness properties of the Hamiltonian curves as we start varying the initial location and velocity of the curve. In SectionĀ 7, we prove an isoperimetry inequality on the Riemannian manifold equipped with metric , the Hessian of our hybrid barrier. In AppendixĀ A, we give some background on Differential Geometry. In AppendixĀ B, we describe how to derive the second order Hamiltonian ODE based on the covariant derivative on the manifold. In AppendixĀ C, we show the infinite-norm third-order self-concordance of the metric for our hybrid barrierĀ (4). AppendixĀ D is devoted to obtaining spectral bounds for the derivatives of our metric, which includes Lewis weights and its derivatives, which we use in our self-concordance arguments. Finally, in AppendixĀ E we include missing proofs.
2 Preliminaries
To work with the metric imposed by our hybrid barrier , it is convenient to rescale the rows of the LP matrix by the slack variables, namely we define
[TABLE]
In our equations we treat hadamard product of matrices with higher priority, namely is equivalent to . We refer to the -Lewis weights vector of by and its diagonal matrix version by \mathbf{W}_{x}\triangleq\texttt{Diag}\big{(}{w_{x}}\big{)}. To work with a vector in the tangent space of , there is an important reparameterization of defined as
[TABLE]
Define the log barrier by :
[TABLE]
We denote the Hessian of the log barrier by . We see as a metric inside the polytope, such that for it defines a local metric . It is easy to check that the norm of a vector with respect to , i.e. , is given by the norm of the reparameterized vector defined in EquationĀ (13).
[TABLE]
For a given point inside polytope , we define the symmetrized polytope around as the following: we reflect around and intersect it with the namely , as illustrated in FigureĀ 1(a). The approximation of the symmetrized body by the ellipsoids corresponding to the Hessian of the barrier function plays a key role in bounding the isoperimetry constant, as we describe in SectionĀ 7.
2.1 John Ellipsoid and Lewis weights
Proving good isoperimetry for a specific barrier can be reduced to how well the ellipsoids corresponding to the Hessian of the barrier at each point inside the polytope approximate the symmetrized polytope around . A natural way to approximate a symmetric polytope is via its John Ellipsoid, i.e. the ellipsoid of maximum volume contained in the polytope. Parametrizing the John ellipsoid as for a positive diagonal matrix , i.e., a weighted sum of the outer product of the rows of , the weights are characterized by the following optimization problem:
[TABLE]
where \mathrm{W}=\texttt{Diag}\big{(}{w}\big{)} is the diagonal matrix corresponding to the vector . The John ellipsoid approximates the symmetrized polytope in the sense that (1) it is inside the ellipsoid and (2) scaling it up by will make it contain the symmetrized polytope.
On the other hand, in order to prove smoothness of the HMC curves, we need to pick a barrier whose Hessian does not change too fast as a function of . Unfortunately the John ellipsoid is not stable. In particular, the weights which maximizesĀ (14) are not even continuous with respect to . An alternative is to use the -Lewis weights to define the ellipsoid, obtained as the solution to a relaxation of the program inĀ (14):
[TABLE]
where \mathrm{W}=\texttt{Diag}\big{(}{w}\big{)}. Moreover, the optimal value of the program inĀ (15) is denoted by the Lewis weights barrier at as defined next.
Definition 6** (Lewis weights barrier).**
The -Lewis weights barrier can be defined as the solution of the following optimization problem:
[TABLE]
Let be the metric defined by the Hessian of the Lewis weights barrier. It is known (Lemma 31 inĀ [19]) that the ellipsoid corresponding to is roughly the same as the one defined by the Lewis weights, i.e. .
Lemma 2.1** (Lewis weights metric).**
For the Lewis weight barrier we can bound the local norm of its Hessian as
[TABLE]
where for a vector ,
[TABLE]
Equivalently
[TABLE]
Next, we define another important local norm at a point inside the polytope:
[TABLE]
This norm plays a key role in our definition of strong self-concordance in EquationĀ (5). For any point inside the polytope, we define to be the projection matrix of reweighted by
Definition 7** (Projection matrix).**
we define the projection matrix , implicitly depending on , as
[TABLE]
where is the -Lewis weights calculated at . Moreover, we denote the Hadamard square of the projection matrix by :
[TABLE]
To show the estimates in LemmaĀ 1.4 for the -Lewis-weights barrier , we need to calculate the derivatives of the Lewis weights. The following Lemma presents the form of the Jacobian of the lewis weights as a function of , by taking its directional derivative in direction .
Lemma 2.2** (Derivative of the Lewis weights).**
For arbitrary direction , the directional derivative can be calculated as
[TABLE]
where we define
[TABLE]
Due to the importance and repetition of the vector in our calculations later on, we give it a separate notation
[TABLE]
Then, the derivative of can be written as
[TABLE]
In the above Lemma, note that , , , and are all functions of the location variable , but we drop for clarity in our calculations. Furthermore, when is clear from the context, we denote in short by . Next, we calculate the derivative of the projection matrix onto the column space of which is appropriately reweighted by the Lewis weights, as defined in DefinitionĀ 7.
Lemma 2.3** (Derivative of the projection matrix).**
The derivative of the projection matrix in direction is given by
[TABLE]
where is defined in EquationĀ (19). When is clear from the context, we refer to by for brevity. Moreover, controlling the spectral norm of the diagonal matrix {\mathrm{R}_{x,v}}=\texttt{Diag}\big{(}{\mathbf{G}_{x}^{-1}\mathbf{W}_{x}s_{x,v}}\big{)} by the infinity norm of is one of the key ideas that allows us to break the mixing time.
To reduce notation, in the proof we also make the dependence of to implicit and drop the index .
We denote the target probability distribution inside the polytope by . We use for the Hessian of our hybrid barrier . We refer to the Hessian of the Lewis-p-weight before rescaling by , and the Hessian of scaled log barrier by , i.e.
[TABLE]
Throughout the proof, we use the notation to show an inequality with ignoring the logarithmic factors. We use for Euclidean derivative and and for covariant differentiation with respect to the metric structure on the manifold. Moreover, we use to show Lƶwner inequalities up to universal constants.
2.2 Markov chains
For a Markov chain with state space , stationary distribution and next step distribution for any , the conductance of the Markov chain is defined as
[TABLE]
The conductance of an ergodic Markov chain allows us to bound its mixing time, i.e., the rate of convergence to its stationary distribution, e.g., via the following theorem of LovƔsz and Simonovits.
Theorem 2.4**.**
Let be the distribution of the current point after steps of a Markov chain with stationary distribution and conductance at least , starting from initial distribution . For any ,
[TABLE]
To bound the conductance, we will reduce it to geometric isoperimetry.
Definition 8**.**
The isoperimetry of a metric space with target distribution is
[TABLE]
where is the shortest path distance in .
For a proof of the following theorem, see e.g.,Ā [28].
Lemma 2.5**.**
Given a metric space and a time-reversible Markov chain on with stationary distribution , fix any and suppose that for any with , we have . Then, the conductance of the Markov chain is .
We will need a more refined notion of -conductance, to be able to ignore small subsets when proving isoperimetry.
Definition 9** (-conductance).**
Consider a Markov chain with a state space , a transition distribution and stationary distribution . For any , the -conductance of the Markov chain is defined by
[TABLE]
A lower bound on the -conductance of a Markov chain leads to an upper bound on its mixing rate.
Lemma 2.6**.**
[23]** Let be the distribution of the points obtained after steps of a lazy reversible Markov chain with the stationary distribution . For and , it follows that
[TABLE]
The following theorem (see Ā [15]) illustrates how one-step coupling with the isoperimetry leads to a lower bound on the -conductance. Its proof is similar to that of Lemma 13 in [21] and can be found in full detail in AppendixĀ E.1.
Theorem 2.7**.**
For a Riemannian manifold , let be the stationary distribution of a reversible Markov chain on with a transition distribution . Let be a subset with for some . We assume the following one-step coupling: if for , then . Then for any and given , the -conductance is bounded below by
[TABLE]
3 Hybrid barrier metric and second-order self-concordance
The goal of this section is to prove the strong self-concordance properties of our hybrid barrier as defined in LemmaĀ 1.3. We start by developing some basic properties of Lewis weights, the corresponding metric, and their derivatives, which we exploit throughout the proof. For sake of clarity of the calculations, we denote the matrix regarding vector , which will appear a number of times by . Here we show the infinity norm self-concordance for the first and second order derivative of the metric as a warm up. For the proof of our third order self-concordance, we refer the reader to sectionĀ C. In this section, for sake of brevity and clarity of the proof, we do not track the constants (which depends on ) and all of our inequalities are up to log factors.
The following Lemma is proved in appendixĀ E.2.
Lemma 3.1** (-Lewis-weight metric).**
The p-Lewis weight metric can be written in the following form
[TABLE]
or alternatively
[TABLE]
In the following Lemma we state a vital norm bound for the matrix which enables us to obtain Lƶwner inequalities by pulling off the norm of , the direction of the derivative. Note that condition is vital for this norm bound.
Lemma 3.2** (Operator infinity norm bound).**
For , given any vector and , we have
[TABLE]
Proof.
The proof can be found in AppendixĀ D.1. ā
Next, we state a lemma regarding the expansion of the directional derivative of the Lewis weights metric .
Lemma 3.3** (Derivative of the -Lewis weights metric).**
Given arbitrary direction , we have
[TABLE]
We have numbered the terms above by to refer to them later on.
In order to show the first, second, and third self-concordance of our metric, we need to control the terms above as well as their first and second derivatives. We give the proof for the first and second order self-concordance in this section and delay the proof of third order self-concordance to appendixĀ C. Here, we start with a lemma which illustrates the calculation of the derivative of the term above. Ultimately we derive spectral bounds for each of the terms in these derivatives. We do not care about constants and factors of in these calculations (note that with the choice these factors are at most polylogarithmic). Therefore, to simplify our calculation a bit, we ignore these constants.
Lemma 3.4**.**
The derivative of the term in EquationĀ (22) in direction is given by (up to constants)
[TABLE]
where in the last term we are considering as a fixed vector (i.e. the derivative in direction does not hit ).
Proof.
Follows from ordinary differentiation and applying LemmaĀ 2.3. ā
In order to get a handle on these matrices via Löwner ordering, we derive various stability Lemmas for the derivatives of the Lewis weights and their related matrices , , etc and the stability of their derivatives. For example, we show the following third order self-concordance type property for Lewis weights themselves. The following Lemma is proved in Appendix D.2 in Lemma D.12.
Lemma 3.5** (Third derivative bound for Lewis weights).**
We have
[TABLE]
where recall .
Recall that the symbol means Löwner order up to a constant factor. For more details and the proofs, we refer the reader to Appendix D. Next, we proceed to show our first- and second-order strong self-concordance for the Lewis weight barrier. Note that strong self-concordance is easily checked for the log barrier, so the major remaining challenge is to prove it for the Lewis weights barrier. The general theme of the proof is that we pull out the infinity norm of the directional derivative vectors from the tensors that are generated as a result of differentiation. This requires us to develop estimates on various fundamental matrix quantities that we defined in section 2, namely at any point inside the polytope. Importantly, we develop these estimates with respect to the norm instead of the usual metric norm , which crucially requires . This constraint on has its root in controlling the norm of the matrix in Lemma D.1.
Lemma 3.6** (First order infinity norm self-concordance).**
For a direction we have
[TABLE]
Proof.
Direct consequence of LemmasĀ D.4 andĀ D.13. ā
In the rest of this section, we bring the proof of the second order strong self-concordance of our metric.
Lemma 3.7** (Second order infinity norm self-concordance).**
The second derivatives of the metric of our hybrid barrier is bounded as
[TABLE]
Proof.
The goal is to look at the quadratic form of on arbitrary vector , i.e. and control it with . First, we consider each of the subterms as a result of differentiating in LemmaĀ 3.3, in direction . This derivative is expanded in LemmaĀ 3.4. Regarding the term of this expansion in LemmaĀ 3.4, we have
[TABLE]
Next, for the term in LemmaĀ 3.4:
[TABLE]
The first part is similar to the handle of term in EquationĀ (23). For the second part:
[TABLE]
For the term in LemmaĀ 3.4:
[TABLE]
where we used LemmaĀ D.7 andĀ D.3. Next, for term :
[TABLE]
Terms and are similar. For term , for the first term , note that
[TABLE]
which can be dealt with similar to term using LemmaĀ D.1. The second term in is also similar to . For the last term in , note that
[TABLE]
which implies
[TABLE]
As a result,
[TABLE]
The bound for term in LemmaĀ 3.4 follows similarly, using LemmaĀ D.14:
[TABLE]
Next, we move on to bound the directional derivative of term in LemmaĀ 3.3, in direction . This derivative is calculated in LemmaĀ E.3 in the Appendix. For subterm of defined in LemmaĀ E.3, using LemmaĀ D.13:
[TABLE]
For subterm of defined in LemmaĀ D.13, we have using LemmasĀ D.15 andĀ D.3:
[TABLE]
For subterm of , using LemmasĀ D.3,Ā D.4, andĀ D.13:
[TABLE]
Subterm of is similar to and subterm is similar to subterm .
Now considering the second formulation of the metric presented in LemmaĀ 3.1, in EquationĀ (21), above we handled the case where one of the directional derivatives, with respect to either or , hits the part in the last term of the metric in EquationĀ (21). Hence, regarding this last term, the remaining terms in its derivative are the ones for which the derivative with respect to both of and hit either the matrix or the matrix, i.e.
[TABLE]
All of the terms inĀ (25) can be bounded by . For terms in the first line of EquationĀ (25) we use LemmasĀ D.3 andĀ D.9. For the second line we use LemmasĀ D.3 andĀ D.4, andĀ D.1. The bound on the rest of the terms in EquationĀ (25) follows from LemmasĀ D.3 andĀ D.1 as well. Hence, overall we have shown for the last term in EquationĀ (21):
[TABLE]
On the other hand, the derivative of the initial terms , , in EquationĀ (21) are similarly handled using LemmasĀ D.12,Ā D.9, andĀ D.3, andĀ D.1. This completes the proof of the second order strong self-concordance for .
ā
Next, we move on to the third order self-concordance. For this, the number of terms grow quite large but luckily bounding them uses a similar approach. Hence, to give the essential ideas and derivations, we omit the proofs for the similar terms and only illustrate with the directional derivative of the term in EquationĀ (3.3), which is the most complicated to handle. We state our final result for the directional derivatives of in LemmaĀ 3.8 below (for the proof, see AppendixĀ C).
Lemma 3.8** (Second derivative of ).**
Let be the symmetrized version of the term in LemmaĀ 3.3:
[TABLE]
where recall . Then, two times derivative of in directions and can be spectrally controlled by the metric norm as the following:
[TABLE]
Finally, it is not hard to see that the log barrier also satisfies the infinity norm strong self-concordance. For completeness, we state this in the following Lemma, proved in AppendixĀ D.6.
Lemma 3.9** (Infinity self-concordance of the log barrier).**
The metric regarding the log barrier in the polytope satisfies infinity norm third order strong self-concordance:
[TABLE]
Combining LemmaĀ 3.9 with the infinity norm self-concordance of the Lewis weights metric proves the infinity self-concordance of the metric regarding our hybrid barrier.
Proof of LemmasĀ 1.4 andĀ 1.3.
Proof of LemmaĀ 1.4 is a direct consequence of LemmasĀ 3.6,Ā 3.7, andĀ C.1, andĀ 3.9. Proof ofo LemmaĀ 1.3 follows from LemmaĀ 1.4 and noting the fact that the norm can be upper bounded by the norm according to LemmaĀ 7.4. ā
4 Bounding conductance and mixing time
The goal of this section is to illustrate how we combine different pieces together to prove TheoremĀ 1.1. To this end, we prove a general purpose mixing time on a manifold in TheoremĀ 4.1. The key to show TheoremĀ 4.1 is LemmaĀ 4.6 which we defer its proof to later. We start by defining an important concept of a āNice set,ā which links the initial velocity to the normality.
Definition 10** (Nice set).**
Given , we say a set is -nice if for , we have
. 2. 2.
for every with , the Hamiltonian family of curves between and ending at is -normal.
Theorem 4.1**.**
Suppose we want to sample from some distribution on the manifold , starting from distribution with . Suppose there exists a set with , such that for every there exists an -nice set . Moreover, let be the isoperimetric constant of the pair . Then, for any satisfying , , , the mixing time to reach a distribution within TV distance of is bounded by
[TABLE]
Proof.
Now with this choice of , LemmaĀ 4.6, which given a nice set for shows a bound on the closeness of the one step distributions, implies for every and every within distance :
[TABLE]
Using TheoremĀ 2.7, for we get a lower bound on the -conductance for :
[TABLE]
Now using LemmaĀ 2.6 with the same choice of ,
[TABLE]
where we used the fact that (recall the definition of ) and the fact that we pick of the order as . The proof is complete.
ā
What remains to show is LemmaĀ 4.6 regarding the closeness of the one step distributions of the Markov chain. which is the main content of this section. This is vital in proving TheoremĀ 4.1 as it is one of the main building blocks, in addition ot the isoperimetry of the target measure, to bound the conductance of the chain.
To prove LemmaĀ 4.6, we start with some definitions. The overall plan is that we approximate the density of a Hamiltonian step as written in EquationĀ (26) as in EquationĀ (27) and bound its change going from to for most of the vectors within a nice set in the tangent space of .
Definition 11**.**
Consider a family of Hamiltonian curves for time interval all ending at , where , and . Define the local push-forward density of onto by
[TABLE]
where is the inverse Jacobian of the Hamiltonian after time , sending to , which we denoted by . we consider the Jacobian as an operator between the tangent spaces. The push forward density at with respect to the manifold measure is given by
[TABLE]
Note that refers to the manifold measure. Define the approximate local push-forward density of as
[TABLE]
Lemma 4.2** (Lemma 22 inĀ [21]).**
For an -normal Hamiltonian curve, for we have
[TABLE]
Lemma 4.3** (Lemma 32 inĀ [21]).**
In the setting of LemmaĀ 4.4, for an normal , denoting by , we have
[TABLE]
Lemma 4.4** (Change of the pushforward density).**
Consider the family of smooth Hamiltonian curves up to time from to pointing towards , namely , , and regarding a point along the geodesic between to whose tangent to the geodesic is . Then, given that is normal for and , we have
[TABLE]
Proof.
Simply differentiating EquationĀ (27):
[TABLE]
where we used LemmaĀ 4.3. Furthermore, using LemmaĀ 5.5 and noting our assumption :
[TABLE]
ā
Lemma 4.5** (Change in probability of events under approximate density).**
Let be a nice set in the tangent space of and let be an arbitrary point in the geodesic between and . For vector in the tangent space of with we can consider the family of hamiltonian curves between and with for all .Now let be the finite measure obtained by restricting the normal distribution in the tangent space of to vectors for which the corresponding . For a point , let be the approximate pushforward density of onto , defined as
[TABLE]
where is defined inĀ (27). We define to be the corresponding finite measure. Now given a fixed event with probability , we have
[TABLE]
and for all :
[TABLE]
Note that depends on , and we are fixing the set in the tangent space of .
Proof.
Let be the density of further restricting to ās for which where recall , and be such that . Note that
[TABLE]
But note that for the first term
[TABLE]
To see why the second line holds, note that the hamiltonian curve from to is normal from our assumption for time . The second line follows from LemmaĀ 4.4. The third line follows simply by the choice .
Similarly for the second term
[TABLE]
where we used . Combining these and putting back inĀ (31) implies
[TABLE]
To show caseĀ (30), using the fact that the densities regarding and are within constant of one anotherĀ (28):
[TABLE]
which follows from assumption on while
[TABLE]
which follows becuae is a low probability event using gaussian tail bound. This completes the proof. ā
Using the bounds on smoothness, we will show that one-step distributions of RHMC from two nearby points will have large overlap (and hence TV distance less than ).
Lemma 4.6** (One-step coupling for RHMC).**
Consider two points and and suppose is a -nice set in the tangent space of . Now given step size such that and close by point such that , where is the distance on the manifold, the total variation distance between and is bounded by .
Proof.
Similar toĀ (29), we define
[TABLE]
First, note that for any event , we have using LemmaĀ 6.7
[TABLE]
Suppose be a set for which
[TABLE]
This means , and in particular fromĀ (78)
[TABLE]
which also implies
[TABLE]
Now from Ā (28) we have . But now using the assumptions on and and plugging it into EquationĀ (30) in LemmaĀ 4.5 we can state
[TABLE]
which implies at time we have
[TABLE]
or in other words
[TABLE]
Now again applying the constant boundedness of the ratio between and , we obtain
[TABLE]
By picking small enough constants, EquationĀ (33) implies
[TABLE]
This further implies fromĀ (78):
[TABLE]
which contradicts EquationĀ (32). This completes the proof. ā
Finally, Combining TheoremsĀ 4.1 andĀ 1.8 and LemmaĀ 1.7, we prove the main TheoremĀ 1.1.
Proof of TheoremĀ 1.1.
Given a fixed parameter , using LemmaĀ 6.7, there exists a high probability set ,
[TABLE]
such that every has a corresponding nice set .
(Recall is the distribution supported on the polytope with density .)
Now for the same arbitrary we considered above, we wish to satisfy the conditions in TheoremĀ 4.1 on , namely , , (We have used this notation to emphasize that are function of ). But according to TheoremĀ 1.8, these parameters can be set as:
[TABLE]
plus LemmaĀ 1.7 imposes the following condition :
[TABLE]
Hence, the conditions on translates into
[TABLE]
Note that a sufficient condition on which satisfies all of the above constraints is (assuming )
[TABLE]
Now to satisfy the condition in TheoremĀ 4.1, noting EquationĀ (34), we set
[TABLE]
On the other hand, from TheoremĀ 1.2, we see that for the choice of converging to from below ( is a small constant), the square of the isoperimetry constant is . Now plugging this and fromĀ (35) into TheoremĀ 4.1 and noting the choice of we get the following mixing bound:
[TABLE]
But it is easy to check that picking only adds a factor to . Note that with this choice of , we have , hence the mixing time becomes
[TABLE]
But note that if or , then . Hence, the mixing time boils down to
[TABLE]
ā
5 On the Geometry and Stability of Hessian Manifolds
In this section, we prove the smoothness of the operator , namely we show with that a nice Hamiltonian curve is normal. Our proof does not open up the definition of the mtric and its derivatives for our hybrid barrier, instead we exploit the strong-self concordance property in LemmaĀ 1.4 to show the desired smoothness bounds, hence our framework potentially can be applied in other settings. Interestingly, in order to bound the trace of certain operators that arise from bounding the smoothness of the Hamiltonian curves on manifold, it turns out that writing them as the average of random low rank tensors will enable us to apply our strong self-concordance estimates more efficiently and provide sufficient bounds to break the mixing time.
5.1 Bounding
Lemma 5.1**.**
For the parameter regarding the Frobenius norm bound of , given the control over the infinity norm of , (note that the vector is inherent in the definition of ), then we have
[TABLE]
Proof.
Directly follows from LemmasĀ 5.16 andĀ 5.17. ā
First, recall the definition of the Frobenius norm:
[TABLE]
To bound , i.e. the Frobenius norm of , note that
[TABLE]
where is the Riemann tensor and is obtained from the bias vector . In particular, we have
[TABLE]
We start from the Riemann tensor. The proof of this bound follows directly from the infinity norm second-order self-concordance of .
Lemma 5.2** (Frobenius norm of random Riemann tensor).**
Assuming , we have
[TABLE]
Proof.
For the first term of as written inĀ (36):
[TABLE]
For the second term of the Riemann tensor:
[TABLE]
ā
LemmaĀ 5.2 states as an upper bound on the Frobenius norm of given that the curve is nice.
Next, we prove a lemma regarding the expansion of the operator , applying the covariant derivative.
Lemma 5.3** (Subterms for operator ).**
We have the following expansion for the subterms of operator :
[TABLE]
where
[TABLE]
Moreover,
[TABLE]
Proof.
By differentiating the first term:
[TABLE]
But noting that , the first and third terms are the same and we get the result. For the second term:
[TABLE]
Finally, for the second argument of the Lemma
[TABLE]
ā
Next, we bound the Frobenius norm of the part in the following lemma, again only using infinity norm second-order self-concordance of to bound each of the four terms.
Lemma 5.4** (Frobenius norm of operator ).**
We have
[TABLE]
Proof.
To bound the Frobenius norm of the first part of the first term of operator stated in LemmaĀ 5.3:
[TABLE]
where in the second line we are rewriting as which is true due to the symmetry of the derivatives of the metric on Hessian manifolds, i.e. . Furthermore, we used LemmaĀ 5.8 in the last line. For the second part of first term of , note that , so the Frobenius norm is at most automatically. Next, for the first part of the second term of , again based on LemmaĀ 5.3
[TABLE]
where in the last line we used LemmaĀ 5.11. For the second part of the second term of , from LemmaĀ 5.3:
[TABLE]
for the first part
[TABLE]
For the second part:
[TABLE]
ā
Combining LemmasĀ 5.4 andĀ 5.2 concludes
[TABLE]
5.2 Bounding
Here we state the bound on .
Lemma 5.5**.**
For point on a -nice Hamiltonian curve with , namely that and along the curve up to time , suppose now we move on the unit direction parameterized by . Then, the change in the trace of the operator can be bounded as
[TABLE]
Proof.
Directly from LemmasĀ 5.6 andĀ 5.14. ā
In sectionsĀ 5.2.1 andĀ 5.2.2, we bound the change in the part and the Ricci part of respectively.
5.2.1 Bounding the change in Operator
Given a distribution that we want to sample from, we study the properties of the derivatives of the corresponding operator which is defined as
[TABLE]
where
[TABLE]
Recall from LemmaĀ 5.3:
[TABLE]
where we defined matrices and . Here we introduce the main lemma of this section which bounds the derivative of the trace of :
Lemma 5.6** (Bound on the change of operator ).**
For operator defined inĀ (40) for any unit direction we have
[TABLE]
Proof.
To prove LemmaĀ 5.6, we bound the derivative of and in direction separately in LemmasĀ 5.7 andĀ 5.9. As a result, the proof of LemmaĀ 5.6 directly follows from LemmasĀ 5.7 andĀ 5.9. ā
We start from in the following Lemma.
Lemma 5.7** (Trace of ).**
Regarding the operator , we have
[TABLE]
Proof.
Note that from LemmaĀ 5.3:
[TABLE]
For the second part, note that . Hence
[TABLE]
So we only need to handle the derivative of the first part. First, we bound the -norm of the vector in the following helper lemma.
Lemma 5.8**.**
For the gradient of the potential we have
[TABLE]
Proof.
We decompose the potential as for
[TABLE]
where are the -Lewis weights.
Now using LemmaĀ D.28, we have
[TABLE]
[TABLE]
where is the projection matrix regarding the reweighted matrix by the Lewis weights . Note that we are using LemmaĀ 2.1 to conclude that . For the log barrier part, similarly:
[TABLE]
which completes the proof. Now we handle the first term of the operator, namely the first term inĀ (41) using the helper Lemmas. ā
Now we got back to bound the first term inĀ (42), which we can expand as
[TABLE]
For the first term inĀ (43), according to LemmaĀ 5.8:
[TABLE]
where we used LemmaĀ D.27 to bound and used LemmaĀ 7.4. For the second term inĀ (43), we follow a similar reasoning:
[TABLE]
Therefore, bounding boils down to bounding . Focusing on the subterm of regarding , namely
[TABLE]
where we used LemmaĀ 5.8 and LemmaĀ 7.4. Similarly for :
[TABLE]
where we used LemmaĀ 7.4. Combining the above with the inequality
[TABLE]
and plugging back into EquationĀ (45) implies the following bound on the second term in EquationĀ (43), we have for the second term in EquationĀ (43):
[TABLE]
For the third term inĀ (43), we reduce it to the first group of terms. Note that
[TABLE]
which is the same upper bound obtained in EquationĀ (44) andĀ (48). Hence, combining EquationsĀ (44),Ā (48), andĀ (49) we conclude
[TABLE]
ā
Next, we focus on the second term inĀ (41) and bound the derivative of the trace of the operator .
Lemma 5.9** (Trace of ).**
For operator as defined in EquationĀ (41) we have
[TABLE]
Let
[TABLE]
Proof.
From LemmaĀ 5.3, we have
[TABLE]
We bound the derivatives of the two terms in EquationĀ (50) separately in LemmasĀ 5.10 andĀ 5.13. Hence, the proof of LemmaĀ 5.9 directly follows from these Lemmas. ā
We start from bounding the derivative of the first term in EquationĀ (50), i.e. we wish to bound .
Lemma 5.10**.**
Regarding the first quadratic form in EquationĀ (50), we can bound its trace as
[TABLE]
Proof.
To this end, we repeat a similar arguemnt as we did in EquationĀ (43) for bounding
[TABLE]
In particular, our argument regarding in EquationsĀ (46) andĀ (47) only cares about the bound on and . We show a similar bound for . As a warmup, we start by bounding the norm , then we move on to bounding .
Lemma 5.11**.**
We have
[TABLE]
Proof.
We have
[TABLE]
where is a vector with its th entry equal to . The first inequality above is due to Cauchy-Schwarz, and the second one is due to LemmaĀ D.27. ā
Furthermore, we have the following bound on :
Lemma 5.12**.**
For the derivative of in direction we have
[TABLE]
Proof.
Note that
[TABLE]
For the first term above,
[TABLE]
following our argument inĀ (51):
[TABLE]
For the second term, we write the second within the tracec as an expectation , i.e.
[TABLE]
Therefore, using independent normal vectors , we can rewrite the second term as
[TABLE]
where the first inequality follows from Cauchy-Schwarz and the second one follows from LemmaĀ D.29 and the fact that . For the third term similarly
[TABLE]
Combining all three bounds similar to our argument for we conclude
[TABLE]
ā
According to LemmaĀ 5.12, similar to our bound for by substituting with in LemmaĀ D.31 we get
[TABLE]
Moreover, according to LemmaĀ D.32 and LemmaĀ 5.11:
[TABLE]
Further, using LemmaĀ D.31 combined with LemmaĀ 5.11:
[TABLE]
Hence, combining EquationsĀ (53),Ā (54), andĀ (55),
[TABLE]
which completes the bound for the trace of the first part of the operator in EquationĀ (50). ā
Finally, we move on to bound derivative of the trace of the second operator in EquationĀ (50), namely .
Lemma 5.13**.**
We can bound the derivative of the trace of the second operator in EquationĀ (50) as
[TABLE]
Proof.
Recall from LemmaĀ 5.3:
[TABLE]
Now we wish to calculate the derivative of the trace of this operator, namely
[TABLE]
We separate the case when the derivation w.r.t is taken with respect to the outer inĀ (58). First, we calculate the derivative with respect to the outer regarding the term :
[TABLE]
Note that
[TABLE]
Note that this 2-form is symmetric and PSD since
[TABLE]
Moreover, note that
[TABLE]
Hence, EquationĀ (59) can further be upper bounded as
[TABLE]
But we have already bounded the operator norm of in LemmaĀ D.30 by , which implies its trace can be at most . Taking expectation, we have
[TABLE]
Hence, we conclude
[TABLE]
On the other hand, note that for the second term in EquationĀ (57), there is a symmetry between the inner and outer :
[TABLE]
Hence, it is sufficient to bound when taking derivative with respect hit one of them, namely the inner .
Therefore, we move on to taking derivative with respect to the part of . For this, we can again use the trick of writing as :
[TABLE]
But from EquationĀ (57), we have
[TABLE]
Now taking derivative with respect to :
[TABLE]
But for the first term inĀ (62), we can write:
[TABLE]
For the second term inĀ (62):
[TABLE]
where we used the third order self-concordance property of with respect to the infinity norm, as shown in sectionĀ C, and also LemmaĀ 7.4. Combining EquationsĀ (60),Ā (63), andĀ (64) completes the porof of LemmaĀ 5.13. ā
5.2.2 Bounding the change in the Ricci Tensor
First, we state the main result of this section, which is a bound on the change of the Ricci tensor.
Lemma 5.14** (Bound on the change of Ricci tensor).**
Given the assumptions of LemmaĀ 5.5, we have
[TABLE]
Note that in the above, is implicitly a function of as well.
Proof.
According to LemmaĀ A.5 has two terms. We start analyzing the first term:
term
Taking derivative of this subterm of Ricci tensor in direction :
[TABLE]
Now we use LemmasĀ 3.7 andĀ 3.6 to bound these terms:
[TABLE]
Similarly
[TABLE]
Terms in the derivative of that involves the derivative of
Differentiating with respect to , we get
[TABLE]
where we used LemmaĀ D.25 to bound . ā
Second part of the Ricci Tensor.
We should take derivative of in direction , which is the second term in the Ricci tensor according to LemmaĀ A.5. As a warm up, we first bound the value of this term before taking derivative:
Before taking derivative w.r.t
Note that the second part of the Ricci tensor is
[TABLE]
Hence, we only need to bound one of the RHS terms with high probability. We have
[TABLE]
Now to bound the derivative of this part of the Ricci tensor, first we pretend that is fixed. Then
[TABLE]
which we further bound as
[TABLE]
Next, we take derivative in direction from the second term of the Ricci tensor.
Taking derivative in direction .
First, we differentiate the inner term in :
[TABLE]
For the remaining derivatives we can substitute the inner by . Now for the remaining derivatives which does not involve differentiating :
[TABLE]
Finally we have to check when differentiates :
[TABLE]
where we used LemmaĀ D.25 to bound .
5.3 Bounding
Here we bound the parameter which is defined as the maximum possible value of the norm of , where is the parallel transport of the initial velocity. The idea is to bound the infinity norm of along the Hamiltonian curve, then show a more efficient bound compared to the naive operator norm of which works with both of the norms and .
Recall the definition of the parameter :
[TABLE]
where is the parallel transport of along the Hamiltonian curve .
Lemma 5.15** (Bound on ).**
Given that is -nice, we have
[TABLE]
up to time .
Proof.
From the definition of niceness, we have a upper bound on the infinity norm . Using that, we can apply LemmaĀ 5.18 to obtain
[TABLE]
Finally combining this with LemmasĀ 5.16 andĀ 5.17:
[TABLE]
ā
Here we show a norm bound for which we used to bound . To this end, we show bounds on the Riemann tensor and operator separately in LemmasĀ 5.16 andĀ 5.17.
Lemma 5.16** (Operator norm of random Riemann tensor).**
Assuming , we have
[TABLE]
Proof.
Similar to LemmaĀ 5.2, using the form of Riemann expansion in EquationĀ (36):
[TABLE]
ā
Next, we state a similar mix norm bound for operator .
Lemma 5.17** (Operator norm of ).**
we have
[TABLE]
Proof.
Recall from LemmaĀ (5.3):
[TABLE]
Starting from the first part of the term :
[TABLE]
Note that for the second part, , hence the corresponding operator is the identity and has operator norm one.
Next, we move on to the second term of inĀ (37). For the first part of it from EquationĀ (50), we have:
[TABLE]
where we used LemmaĀ 5.12. For the second part, note that from EquationĀ (38):
[TABLE]
Starting from the first part, now we rewrite this term in a better way as
[TABLE]
Now due to LemmaĀ D.30 the norm of the corresponding operator is one:
[TABLE]
For the second part inĀ (65), we write it as
[TABLE]
Hence, the operator norm is bounded as
[TABLE]
ā
Next, we show a bound on the derivative of the infinity norm of the parallel transported vector given that we know the infinity norm of is constant (randomness + stability).
Lemma 5.18** (Infinity norm of the parallel transport).**
Given and a -nice Hamiltonian curve , we have for :
[TABLE]
where is the parallel transport of along the curve.
Proof.
As is the parallel transport vector, from opening up the covariant derivative being zero:
[TABLE]
which implies using LemmaĀ 7.4:
[TABLE]
where we used from the definition of niceness and the fact that parallel transport preserves the norm of and . This ODE implies to avoid blow up we should pick . Under this condition, we further get
[TABLE]
which completes the proof. ā
In the next section, we show the stability of the infinity norm and the manifold norm of along the curve for to time , where is defined for a fixed time .
6 Stability of Hamiltonian curves
In this section, we show that the niceness property holds for Hamiltonian curves with high probability, and is stable in a family of Hamiltonian curves.
6.1 Stability of the niceness property
Here we show that niceness property of Hamiltonian curves is stable.
Lemma 6.1** (Stability of norms).**
For a family of Hamiltonian curves , given that is -nice, then is also -nice for all . In other words, given that for all we have and , then for all and under the condition
[TABLE]
we have:
[TABLE]
Proof.
Suppose we denote the time until which we run the Hamiltonian curve by , i.e. . Suppose the argument is not true, and consider the set to be the times for which . Since is continuous, the set is open. Hence, if we consider the infimum of times for which , then the infimum is attained, i.e. , while for every time . Exactly the same way we can define the first time for which defining the function we have while for .
First assume the case where . Now again from the continuity of and the fact that is a compact set, its supremum is attained in some time . This means
[TABLE]
for all , while . But now using this infinity norm bound for times (for the fixed time ), we can obtain an Frobenius norm bound for from Lemma inĀ 5.1 as
[TABLE]
. Now we can apply Lemma 23 inĀ [21] because condition is satisfied, so we get
[TABLE]
for every , where we are using the fact that . But note that for we can write
[TABLE]
where the first line follows from opening the definition of covariant derivative. Finally, this ODE implies that for all times (with the correrct choice of constants), which from continuity holds also for time . But this contradicts , which completes the proof for the case . Note that we the use of this condition in the above proof is that the -norm condition does not fail until time .
Next, we consider the latter case . Similar to the above argument, until time we have the Frobenius bound on from LemmaĀ 5.1, and again from Lemma 23 inĀ [21] as , we have
[TABLE]
for . Now we write an ODE to control the norm of where is defined in the same way as the previous case, and get a contradiction:
[TABLE]
which implies
[TABLE]
Therefore, at time the change in from its initial value is at most , which means the value of should have remained below . The contradiction completes the proof for the second case. ā
Next, we show a helper lemma regarding the derivative of in direction :
Lemma 6.2**.**
On a -nice Hamiltonian curve with , We have:
[TABLE]
Proof.
Note that from LemmaĀ 1.7 we have . Hence, from LemmaĀ 5.1, we can apply Lemma 23 inĀ [21] to obtain
[TABLE]
But now from LemmaĀ D.26, setting and :
[TABLE]
From LemmaĀ 1.7, we have and note that from our assumption on the parameterization, , which combined with EquationĀ (69) finishes the proof. ā
6.2 High probability bound on norms along the Hamiltonian curve
First, we show a norm bound for the norm along the Hamiltonian curve, given a bound at initial time.
Recall the ODE related to the RHMC for curve is
[TABLE]
Opening this up
[TABLE]
First, we show a non-random bound on the norm given a bound at time zero.
Lemma 6.3** (Boundedness of manifold norm along the Hamiltonian curve).**
Suppose . Then for time we have
[TABLE]
Proof.
Note that
[TABLE]
hence, taking covariant derivative
[TABLE]
where we used LemmaĀ D.23 to bound . This implies
[TABLE]
Solving this ODE,
[TABLE]
ā
Lemma 6.4** (Stability bound on the infinity norm along the curve).**
For a hamiltonian curve with , suppose for a fixed time we know . Then for all times we have
[TABLE]
Proof.
Consider the Hamiltonian ODE below:
[TABLE]
which implies
[TABLE]
Hence, using LemmaĀ 7.4
[TABLE]
But using LemmaĀ 6.3 having upper bound on the -norm of at time zero implies a bound on the whole curve. Combining with LemmaĀ D.23:
[TABLE]
This ODE implies that if at a given point the infinity norm of is bounded by , then for times within we have an bound on the infinity norm, which completes the proof. ā
Lemma 6.5** (Stability bound on the -norm along the curve).**
For a Hamiltonian curve with , suppose for a fixed time we know . Then for all times we have
[TABLE]
Proof.
Directly from LemmaĀ 6.3. ā
Lemma 6.6**.**
Suppose we pick random from then run a Hamiltonian curve starting from with initial vector picked according to . Then, for any time , with probability at least we have
[TABLE]
Proof.
From the property of the Hamiltonian curve, we know the joint density of is . Focusing on the probability of , we see that for each , is a Gaussian distributed variable with variance
[TABLE]
where the inequality follows from LemmaĀ 7.2. Hence, from Gaussian tail bound, for a fixed time :
[TABLE]
where note that is just the maximum of Gaussian random variables and we applied a union bound over the entries of . Moreover, note that is a subGaussian random variable with mean and subGaussian parameter . Hence
[TABLE]
Next, consider a cover of equally distant times of the Hamiltonian curve from to . Apply the above argument for all the times in this cover with a union bound on top. This implies with probability at least , we have for all and , where we used the fact that . Now combining this with LemmasĀ 6.4 andĀ 6.5 completes the proof. ā
Next, we bring a Lemma which shows the existence of Nice sets, used in the Proof of TheoremĀ 1.1.
Lemma 6.7**.**
[Existence of Nice set] There is a high probability region such that (where recall is the probability distribution of density inside the polytope) and for every , there is a high probability region in the tangent space of , namely such that for all , the Hamiltonian curve starting from with initial vector is -nice, namely for all :
[TABLE]
Proof.
For every point , define to be the set of vectors in its tangent space such that the resulting curve is -nice up to time . Define region to be the the set of points on such that , where denotes the density of in the tangent space of (The constant is motivated by the definition of nice sets). Now if it was the case that , then under the joint distribution on , there is a region with probability at least such that the Hamiltonian curve starting from with initial vector is not -nice. But this contradicts LemmaĀ 6.6. ā
7 Isoperimetry
In this section, we show an the isoperimetry constant corresponding to our barrier, stated in TheoremĀ 1.2.
Proof of TheoremĀ 1.2..
From LemmaĀ 7.3 and the definition of :
[TABLE]
This means that if we scale the ellipsoid by then it includes the symmetrized polytope around , whose unit ball is exactly , i.e.
[TABLE]
On the other hand, from LemmaĀ 7.4 we have
[TABLE]
which implies that the unit ball of the norm, or the Dikin ellipsoid, is contained in the symmetrized poltope around , i.e.
[TABLE]
Combining the relationsĀ (72) andĀ (73) implies that the symmetric self-concordance parameter defined inĀ [17] is at most , which in turn implies that the distribution has isoperimetry with constant at least with respect to metric as desired.
Furthermore, using the Brascamp-Lieb inequality, we know has isoperimetry at least on a manifold whose metric is the Hessian of Ā [1]. Combining these two facts completes the proof. ā
We denote the th row of the matrix by . Note that if we have a bound on the quantity for our metric enables us to control the infinity norm of via the following simple Cauchy Schwarz on the th entry of :
[TABLE]
However, while we have the following relation
[TABLE]
only considering the subpart of our metric , the quantity might be orders of magnitude larger than its counterpart in EquationĀ (74). This is because recall as we state inĀ 2.1
[TABLE]
but we do not have such spectral bounds between matrices and . In fact, authors inĀ [19] show and are up to log factors spectrally the same, as long as is polylogarithmically large, but here we are not able to work with such large ās since our infinity norm estimates break for . Nonetheless, we show that adding the log barrier and appropriately rescaling the metric indeed enables us to bound . To prove a bound on , we start by comparing the matrix , which is proportional to the Hessian of the hybrid barrier before scaling by , with the matrix , which then enables us to analyze the quantity via the closed form EquationĀ (74). In the next Lemma, we compare these two matrices.
Lemma 7.1** (Lƶwner comparison with different weighted matrices).**
For the PSD matrix we have
[TABLE]
Proof.
Suppose for a given coefficient we wish to have
[TABLE]
The first thing we notice is that if , then the inequality is already satisfied. Hence, w.l.o.g we assume
[TABLE]
in this regime of to pick a which satisfies EquationĀ (75), we need to have
[TABLE]
But using EquationĀ (76), it is sufficient to have
[TABLE]
so we need to pick as large as
[TABLE]
which completes the proof. ā
Lemma 7.2** (Taming the hybrid metric).**
For the metric of our hybrid barrier before scaling up by , i.e. for defined as
[TABLE]
we have for every :
[TABLE]
In particular, for the metric of the hybrid barrier we have
[TABLE]
Proof.
Note that using LemmaĀ 2.1, we have
[TABLE]
Hence, using LemmaĀ 7.1:
[TABLE]
On the other hand,
[TABLE]
Balancing EquationsĀ (78) andĀ (79) implies
[TABLE]
Finally, noting the fact that
[TABLE]
the proof is complete. ā
Finally, using our estimate on in LemmaĀ 7.2, we bound the norm of an arbitrary vector :
Lemma 7.3** (Bounding the ellipsoid norm by the infinity norm).**
We can bound the metric norm by the infinity norm as
[TABLE]
Proof.
Using LemmaĀ 2.1, we have
[TABLE]
and
[TABLE]
Noting the definition of in EquationĀ (77) completes the proof. ā
Lemma 7.4** (Bounding infinity norm by the ellipsoidal norm).**
Given an arbitrary vector , we have
[TABLE]
Proof.
For all we have using LemmaĀ 7.2:
[TABLE]
The second inequality follows from the fact that from LemmaĀ D.1. ā
Lemma 7.5** (Infinity norm of random vectors).**
For the metric of our hybrid barrier, given random vector , with high probability we have
[TABLE]
Proof.
Note that is just a scaled version of :
[TABLE]
Now computing the variance of the th entry of , we observe using LemmaĀ 7.2
[TABLE]
The bound on directly follows from the fact that using LemmaĀ D.1. ā
Appendix A Riemannian Geometry
A.1 Basic Manifold Definitions
In this section, we go through some basic definitions in differential geometry that are essential to know in our proofs. A manifold is defined abstractly as a topological space which locally resembles .
Definition 12**.**
A manifold is a topological space such that for each point , there exists an open set around such that is a homeomorphism to an open set of .
Tangent Space.
For any point , one can define the notion of tangent space for , , as the equivalence class of the set of curves starting from (), where we define two such curves and to be equivalent if for any function on the manifold:
[TABLE]
On can define a linear structure on , hence it is a vector space. Now given a positive definite quadratic form on the vector space , one can equip the manifold with metric . While the definition of a general manifold is abstract, putting a metric on it allows us to measure length, areas, volumes, etc. on the manifold, and do calculus similar to Euclidean space. Next, we define some basic notions regarding manifolds.
Differential.
For a map between two manifolds, the differential at some point is a linear map from to with the property that for any curve on with , we have
[TABLE]
. As a special case, for a function over the manifold, the differential at some point is a linear functional over , i.e. an element of . WritingĀ (81) for curve with , testing propertyĀ (81), we see
[TABLE]
We can write .
Vector field.
A vector field is a smooth choice of a vector in the tangent space for all .
Metric and inner product.
A metric is a tensor on the manifold which is simply a smooth choice of a symmetric bilinear map over . Alternatively, the metric or dot product can be seen as a bilinear map over the space of vector fields with the tensorization property, i.e. for vector fields and scalar functions over :
[TABLE]
A.2 Manifold Derivatives, Geodesics, Parallel Transport
A.2.1 Covariant derivative
Given two vector fields and , the covariant derivative, also called the Levi-Civita connection is a bilinear operator with the following properties:
[TABLE]
where is the action of vector field on scalar function . Importantly, the property that differentiates the covariant derivative from other kinds of derivaties over manifold is that the covariant derivative of the metric is zero, i.e., for any vector field . In other words, we have the following intuitive rule:
[TABLE]
Moreover, the covariant derivative has the property of being torsion free, meaning that for vector fields :
[TABLE]
where is the Lie bracket of defined as the unique vector field that satisfies
[TABLE]
for every smooth function .
In a local chart with variable , if one represent , where are the basis vector fields, and , the covariant derivative is given by
[TABLE]
The Christoffel symbols are the representations of the Levi-Cevita derivatives of the basis :
[TABLE]
and are given by the following formula:
[TABLE]
Above, refers to the entry of the inverse of the metric. In the following Lemma, we calculate the Christoffel symbols on a Hessian manifold and is the Hessian of a convex function.
Lemma A.1**.**
On a Hessian manifold with metric we have
[TABLE]
Proof.
Since the manifold is Hessian, we have
[TABLE]
where is just the notation that we use for Hessian manifolds.
ā
A.2.2 Parallel Transport
The notion of parallel transport of a vector along a curve can be generalized from Euclidean space to a manifold. On a manifold, parallel transport is a vector field restricted to such that . By this definition, for two parallel transport vector fields we have that their dot product is preserved, i.e., .
A.2.3 Geodesic
A geodesic is a curve on is a ālocally shortest pathā, i.e., the tangent to the curve is parallel transported along the curve: ( denotes the time derivative of the curve .) Writing this in a chart, one can see it is a second order nonlinear ODE which locally has a unique solution given initial location and speed.
[TABLE]
A.2.4 Riemann Tensor
The Riemann tensor is particular tensor on the manifold which arise from the covariant derivative. In particular, it is a linear mapping from defined as
[TABLE]
The Riemann tensor can be calculated in a chart given the following formula:
[TABLE]
In the following Lemma, we calculate the Riemann tensor on a Hessian manifold:
Lemma A.2**.**
The Riemann tensor is given by
[TABLE]
Proof.
We consider the terms in EquationĀ (85) one by one. For the first term
[TABLE]
Similarly
[TABLE]
Hence
[TABLE]
For the third and forth terms
[TABLE]
Combining EquationsĀ (86) andĀ (88) and plugging intoĀ (85) completes the proof. ā
A.2.5 Ricci tensor
The Ricci tensor is just the trace of the Riemann tensor with respect to the second and third components or first and forth components, i.e. the trace of the operator :
[TABLE]
Equivalently, if is an orthogonal basis in the tangent space, we have
[TABLE]
Lemma A.3** (Form of the Ricci tensor on Hessian manifolds).**
On a Hessian manifold, the Ricci tensor is given by
[TABLE]
Proof.
Using the form of Riemann tensor inĀ (85) and the definition of Ricci tensor inĀ (89)
[TABLE]
Therefore, for arbitrary vector and
[TABLE]
ā
A.2.6 Exponential Map
The exponential at point is a map from to , defined as the point obtained on a geodesic starting from with initial speed , after time . We use to denote the point after going on a geodesic starting from with initial velocity , after time .
Lemma A.4** (Commuting derivatives).**
Given a family of curves for and , we have
[TABLE]
Proof.
Let and be the standard vector fields in the two dimensional space . Then, we know
[TABLE]
where is the Lie bracket. ā
A.3 Hessian manifolds
In this work we are working with a specific class of manifold whose metric is impoesd by the Hessian of our hybrid barrier. A nice property of Hessian manifolds is that the terms in the Riemann tensor which depends on the second derivative of the metric cancels out, and we end up just with the first derivative and the metric itself. Specifically, for a Hessian manifold recall from LemmasĀ A.1,Ā A.2, andĀ A.5 we have the following equations for Cristoffel symbols, the Riemann tensor, and the Ricci tensor:
[TABLE]
As we mentioned, the change of the determinant of the Jacobian matrices regarding the Hamiltonian family between and is related to the rate of change of the Ricci tensor on the manifold. In LemmaĀ A.5 below, we concretely calculate the Ricci tensor for a Hessian manifold in the Euclidean chart, based on the metric and its derivatives.
Lemma A.5** (Form of Ricci tensor on Hessian manifolds).**
On a Hessian manifold, the Ricci tensor is given by
[TABLE]
we use the formula of Ricci tensor on manifold in sectionĀ 5.2 and bound its derivative to bound the rate of change of the pushforward density of RHMC going from to in sectionĀ 5.2.2. Note that we only need to have a multiplicative control over the change of density of a sampled Gaussian vector on the destination point on the manifold, as we move from to .
Appendix B Hamiltonian Curves and Fields on Manifold
Here we recall the formulation of the Hamiltonian curve based on covariant differentiation. Starting from the definition of the hamiltonian ODE for the potential .
[TABLE]
Taking derivative with respect to from the first Equation and then using the second equation, we get
[TABLE]
which implies
[TABLE]
But the left hand side of EquationĀ (91) is the definition of Christoffel symbols as in LemmaĀ (A.1). To see this, note that
[TABLE]
where is the th entry of . Moreover
[TABLE]
Hence, from the definition of Cristoffel symbols and its expansion in EquationĀ (A.2.1) we see
[TABLE]
where is covariant differentiation and we look at as a vector in the tangent space of . We define the right hand side of the above equation as the bias of Hamiltonian Monte Carlo:
[TABLE]
Proof of LemmaĀ 1.6.
We start from the ODE of HMC:
[TABLE]
Taking covariant derivative in direction :
[TABLE]
Now we apply the definition of Riemann tensor. Namely for arbitrary vector fields , we have
[TABLE]
Setting and , we first observe that because they are just the application of the differential of to the standard vectors and in . Applying this above
[TABLE]
But note that because and are the image of the differential of applied to and , we have
[TABLE]
Applying EquationĀ (93) to EquationĀ (92):
[TABLE]
Noting the definition of the operator completes the proof. ā
Appendix C Third order strong self-concordance of the metric
The goal of this section is to prove the following lemma.
Lemma C.1** (Infinity norm Self-concordance for Lewis-p-weight barrier).**
The Lewis-p-weights barrier, defined inĀ (3), is third-order strongly self-concordant with respect to the local norm , i.e., at any point on the Hessian manifold with metric given by the Hessian of the Lewis-p-weights barrier , we have
[TABLE]
Now we first handle the derivatives in directions and of the term in LemmaĀ 3.3. We state the final result regarding the term in the LemmaĀ 3.8, which we prove below.
Proof of LemmaĀ 3.8.
The general style of the proof below is that terms are referring to the subterms obtained from differentiating the term by , which are stated in LemmaĀ 3.4. Note that the term itself is a subterm of the derivative of in direction which is stated in LemmaĀ 3.3.
terms
The first subterm of the term that we consider is the term as defined in EquationĀ 3.4.
term
[TABLE]
For the first part (1), using LemmaĀ D.7:
[TABLE]
For the second part (2), note that
[TABLE]
where we are denoting the big chunk in the middle by for simplicity. But combining LemmaĀ D.6 andĀ D.7
[TABLE]
which implies
[TABLE]
Overall, we conclude
[TABLE]
For (3):
[TABLE]
(4) and (5) are similar. Term (7) is also similar to Equation after applying LemmaĀ D.14. Next, we move on to term.
term
[TABLE]
Note that if differentiate any of the or , then handling those terms is similar to Equation .
term [1] is similar to and.
term [2] is similar to Equation and after using LemmaĀ (D.14).
term [3] the first part is similar to Equation. For the second part
[TABLE]
which similar to can be upper bounded by
[TABLE]
as desired.
term [4] the first part is similar to Equation combined with the trick inĀ (95). For the second part:
[TABLE]
term [5] the first part is similar to Equation and the second part is similar toĀ (96).
term [6] part 1 is similar to [3] part 2, and part 2 is similar to term part 2.
term [7] is similar to.
term [8], the first part is similar to using the trick inĀ (95). For term [8] second part
[TABLE]
term [9] is similar to what we did for [9]. term [10] first part similar toĀ using the trick inĀ (95). for the second part:
[TABLE]
term
[TABLE]
term [1] is similar to .
term [2] is similar to .
term [3] is handled by LemmaĀ D.14.
term [4] first part is similar to part 1. term [4] part 2 is similar to . term [4] part 3 is similar to part 2. For term [4] parts 4 and 5:
[TABLE]
term [5] is similar to .
term [6]:
[TABLE]
term [7]: similar to [6].
term [8]:
[TABLE]
term
[TABLE]
These terms are similar to .
term
[TABLE]
these terms are similar to .
term
[TABLE]
where for simplicity, we have used the notation indicating all possible symmetric combinations of that term with respect to , , and .
term [1]: considering the quadratic form on this term, note that on the left we get . Now we can just reduce this term to to conclude
[TABLE]
term [2]: similar to .
term [3:1]: Noting the fact that
[TABLE]
and using LemmaĀ D.7 this term is similar toĀ (24).
term [3:2]: note that this term is equal to
[TABLE]
which is similar to .
term [3:3], [3:4], [3:5], [3:6]: similar to .
term [4:1] is similar to .
term [4:2], [4:4], [4:5] similar to .
term [4:3] similar to
term [5] is similar to [4].
term [6] is also similar to and .
term
[TABLE]
This term is similar to as detailed in LemmaĀ 3.4.
term
[TABLE]
We have handled this term with regards to the differentiation of any term with respect to , we can instead first take that derivation with respect to and then take the derivative of which respect to which spits out the .
Now based on the form of the metric written in LemmaĀ 3.1, we first focus on the last term . Note that above in handling all the derivatives in directions and of the term, we have bounded all the 3rd order derivative terms of that has at least one derivative regarding the terms in . Hence, regarding this term, it remains to take derivative with only with respect to and the ās which we do next. Again, the sums mean we are considering all the terms corresponding to all the permutations of regarding the current term.
[TABLE]
is handled in a similar way as .
To handle the rest of the derivatives more conveniently at this point, we consider the second form of metric in EquationĀ 3.1. First we aim to handle all the possible derivatives in three directions which differentiate the terms at least once. Taking one time derivative in direction from the term results in term and in LemmaĀ (3.3).
But using LemmasĀ D.11 andĀ D.9 and similar technique as we did, these terms are bounded by plus and minus of two constants times the matrix .
Next, we move on to the other terms in the formulation of inĀ 3.1, namely , , , and . third order self concordance of is a direct consequence of LemmaĀ D.12. Term and are handled by LemmaĀ D.11, and is handled by LemmaĀ D.10. ā
Appendix D Derivative Stability Lemmas
D.1 Infinity norm comparisons
Here we show a control over the infinity to infinity norm, i.e. of the matrix , which is a crucial property that we use all over the proof to derive our derivative estimates with respect to the norm.
Lemma D.1**.**
For , given any vector and , we have
[TABLE]
Proof.
Set . then
[TABLE]
Now suppose , which implies that for the maximizing index we have
[TABLE]
But note that
[TABLE]
hence
[TABLE]
On the other hand
[TABLE]
The contradiction finishes the proof. ā
D.2 Lowner Inequalities
In this section, we drive important estimates on the derivatives of fundamental matrix quantities that we arrive at such as that we defined, and use them in our proof for strong self-concordance.
Lemma D.2**.**
We have
[TABLE]
Proof.
For the matrix we have
[TABLE]
and similarly
[TABLE]
ā
Lemma D.3**.**
We have
[TABLE]
Proof.
For the first inequality, note that the sum of entries of the th row of matrix is equal to . Hence, the matrix is a Laplacian so it is positive semi-definite. The second inequality follows from the fact that is PSD. The third inequality, using the fact that :
[TABLE]
ā
Lemma D.4**.**
For the derivatives of and at some point we have
[TABLE]
Proof.
Directly from LemmasĀ D.2 andĀ D.13. ā
Lemma D.5**.**
[TABLE]
Proof.
We use the terms of the derivative of in direction (according to LemmaĀ D.14) and differentiate them one by one with respect to :
[TABLE]
Now from LemmasĀ D.1 andĀ D.21 andĀ D.13 we have
[TABLE]
[TABLE]
the third and forth terms are similar to the first and second terms resp., for the fifth term
[TABLE]
the derivatives of the other terms are handled in a similar way. ā
Lemma D.6**.**
For a symmetric matrix with , we have
[TABLE]
Proof.
For arbitrary vectors , using the inequality with and :
[TABLE]
ā
Lemma D.7**.**
For diagonal matrices (not necessarily positive) we have
[TABLE]
Proof.
Consider the Choleskey decomposition of :
[TABLE]
Then for the first inequality, note that we can write as
[TABLE]
Hence, for arbitrary vector :
[TABLE]
For the second inequality, note that
[TABLE]
which implies
[TABLE]
Therefore
[TABLE]
Now again using EquationĀ (97):
[TABLE]
ā
Lemma D.8**.**
Given a matrix and arbitrary diagonal matrices and and arbitrary vector :
[TABLE]
Proof.
simply by Cauchy Schwarz:
[TABLE]
ā
Lemma D.9**.**
For matrices and we have
[TABLE]
Proof.
Note that
[TABLE]
But using LemmaĀ D.7:
[TABLE]
On the other hand, note that
[TABLE]
simply by checking the operator norm of LHS. Hence
[TABLE]
On the other hand,
[TABLE]
Therefore, by Schur product theorem
[TABLE]
Moreover,
[TABLE]
and note that
[TABLE]
Hence, by LemmaĀ D.6
[TABLE]
Finally, note that from LemmaĀ D.15:
[TABLE]
All the inequalities that we wrote also hold in the other direction with a negative sign. Combining all the inequalities concludes the proof for . As is also a linear combination of and , using the exact same bounds we can obtain the conclusion for as well. ā
Lemma D.10**.**
We have
[TABLE]
Proof.
We have
[TABLE]
Note that from LemmaĀ D.6, a generic term in the above is of the form
[TABLE]
for diagonal matrices and , such that
[TABLE]
Hence, combining LemmasĀ D.6 andĀ D.7, we get
[TABLE]
Similarly, we can show
[TABLE]
ā
Lemma D.11**.**
We have
[TABLE]
Proof.
Directly from LemmasĀ D.12 andĀ D.10. ā
Lemma D.12**.**
We have
[TABLE]
Proof.
For the first term of the first derivative inĀ (100), further taking derivative. with respect to :
[TABLE]
where we used LemmasĀ D.21 andĀ D.17. For the second term inĀ (100):
[TABLE]
For the third term:
[TABLE]
where for this term we also used LemmaĀ D.15.
Finally the last term is exactly similar to the proof of LemmaĀ D.15 for handling . ā
Lemma D.13**.**
We have
[TABLE]
In particular, for random we have with high probability
[TABLE]
Moreover
[TABLE]
Proof.
Note that \mathbf{W^{\prime}}_{x,v}=-2\texttt{Diag}\big{(}{(}\big{)}\mathbf{\Lambda}_{x}r_{x,v}). Using LemmaĀ D.1, we have . Hence, for every :
[TABLE]
which completes the proof. For random , just note that
[TABLE]
For -norm also use LemmaĀ 7.4 to upper bound infinity norm with -norm. ā
Lemma D.14**.**
For the derivative of in direction we have
[TABLE]
Proof.
We can write
[TABLE]
But note that from LemmaĀ D.1 we have and from LemmaĀ D.21 we have
[TABLE]
which completes the proof. ā
Lemma D.15**.**
We have
[TABLE]
Proof.
We consider :
[TABLE]
Now from LemmasĀ D.18 andĀ D.1:
[TABLE]
ā
Lemma D.16**.**
We have
[TABLE]
Proof.
We use the notation below to consider all the permutations among , , and .
[TABLE]
But note that in general for diagonal matrices we have from LemmasĀ D.18,Ā D.19, andĀ D.20:
[TABLE]
as the proof of LemmaĀ D.19 can be generalized to arbitrary diagonal matrices and in place of and . The proof is complete. ā
Lemma D.17**.**
We have
[TABLE]
Proof.
Directly from LemmaĀ D.16, noting the fact that both and are linear combinations of and . ā
Lemma D.18**.**
We have
[TABLE]
Proof.
[TABLE]
ā
Lemma D.19**.**
We have
[TABLE]
Proof.
Observe that the 2-norm of the th row of the matrix is at most . This is because
[TABLE]
Now note that
[TABLE]
ā
Lemma D.20**.**
We have
[TABLE]
Proof.
Note that by Cauchy Schwarz
[TABLE]
ā
Lemma D.21**.**
We have
[TABLE]
Proof.
Note that
[TABLE]
Now from LemmaĀ D.18, we know
[TABLE]
Now similar to LemmaĀ D.1, we can show
[TABLE]
On the other hand, note that
[TABLE]
so similarly we can argue
[TABLE]
Finally, as both and are a combination of and matrices, this completes the proof. ā
Lemma D.22**.**
We have
[TABLE]
D.3 Norm of the bias
Lemma D.23**.**
We have
[TABLE]
Proof.
For the first part
[TABLE]
from LemmaĀ 5.8. For the second part, writing as an expectation
[TABLE]
we have for independent :
[TABLE]
where we used LemmaĀ D.27. This completes the proof. ā
D.4 Comparison between leverage scores
Lemma D.24**.**
Let
[TABLE]
Then
[TABLE]
which implies
[TABLE]
Proof.
Simply note that , which implies
[TABLE]
ā
D.5 Norm comparison between covariant and normal derivatives
Lemma D.25**.**
Given a family of Hamiltonian curves in the interval where is nice, with , we have
[TABLE]
Proof.
From LemmaĀ 5.1 we have along the curve, so by Lemma 23 inĀ [21] (note that the condition is satisfied) we get
[TABLE]
But now from LemmaĀ D.26
[TABLE]
As always, our parameterization in is always unit norm, so , and from niceness of the curve , which completes the proof. ā
Lemma D.26**.**
For a vector field and arbitrary vector at a point , denoting by , we have
[TABLE]
Proof.
We have
[TABLE]
so
[TABLE]
ā
D.6 Log barrier infinity self-concordance
Proof of LemmaĀ 3.9.
The log barrier metric is
[TABLE]
Its directional derivative is given by
[TABLE]
which can be bounded as
[TABLE]
Similarly, the second and third directional derivatives of are given by
[TABLE]
which can be bounded as
[TABLE]
This completes the proof. ā
D.7 Other helper Lemmas
Lemma D.27**.**
For vector , we have with high probability
[TABLE]
Proof.
Directly from Gaussian moment bounds. ā
Lemma D.28**.**
For the -Lewis weights barrier , we have
[TABLE]
Proof.
Proof is done inĀ [19]. ā
Lemma D.29**.**
For any positive integer , vector , and matrix we have
[TABLE]
Proof.
Directly from the fact that if , then for any matrix we have . ā
Lemma D.30**.**
For operator , we have .
Proof.
We have
[TABLE]
ā
Lemma D.31**.**
For vector field on manifold , we have
[TABLE]
Proof.
We have
[TABLE]
where in the last line we used LemmaĀ D.27. ā
Lemma D.32**.**
For arbitrary vector field on we have
[TABLE]
Proof.
We can write
[TABLE]
Lemma D.33**.**
For vector field we have
[TABLE]
[TABLE]
But note that
[TABLE]
Hence
[TABLE]
where we used LemmaĀ D.27 and LemmaĀ 7.4. ā
Appendix E Remaining Proofs
E.1 Proof of TheoremĀ 2.7
Consider a subset with . Then, to show a lower bound for -conductance, we need to lower bound
[TABLE]
where is the probability that we are in set and the next step of the Markov chain we escape and is the probability measure corresponding to . Recall that is the Markov kernel, specifying the distribution of the next step given we are at point . Now assume that the conductance bound does not hold, i.e. there exists such with
[TABLE]
Note that because the chain is reversible, we have
[TABLE]
and because , we have
[TABLE]
Next, define the set to be the points from which our chance of escaping is at least . Now if , then given that we are in , we have at least chance of escaping which contradictsĀ (103). This means
[TABLE]
On the other hand, note that for point with for , we have
[TABLE]
which means cannot be in , hence it should be in . Therefore, defining the set as the set of points outside which are close to a point in , we have
[TABLE]
On the other hand, from isoperimetry (because ) and the fact that we have
[TABLE]
Therefore, from the assumption :
[TABLE]
which implies from EquationsĀ (105) andĀ (106):
[TABLE]
which proves that the conducance is lower bounded by .
E.2 Properties of Lewis weights
In this section, we recall some properties of Lewis weights which we use in the proof.
Lemma E.1** (Fixed point property of Lewis weights).**
The Lewis weights of the matrix is the unique vector in with W=\texttt{Diag}\big{(}{w}\big{)} such that
[TABLE]
where denotes the leverage scores of the matrix.
Proof.
Recall the definition of Lewis weights as the optimum of the objective in EquationĀ (16). Taking derivative with respect to , we get
[TABLE]
where is the vector of leverage scores defined as
[TABLE]
ā
Proof of LemmaĀ 3.1.
The first form of the Lewis weight metric directly follows from Equation 5.5 in Lemma 31. inĀ [19]. To see why the second form in EquationĀ (21) holds, note that
[TABLE]
Hence
[TABLE]
which implies
[TABLE]
Plugging EquationĀ (107) into the first form in EquationĀ (20) completes the proof. ā
Proof of LemmaĀ 3.1.
The first formulation follows fromĀ [18]. To show the second formulation, recall the definition of :
[TABLE]
Plugging the above into the first formulation results in the second formulation. ā
Proof of LemmaĀ 2.1.
Directly from Lemma 31 inĀ [19]. ā
Lemma E.2** (Gradient of the Lewis weights barrier).**
The gradient of the Lewis weights barrier is given by
[TABLE]
Proof.
Taking directional derivative in direction , using the chain rule
[TABLE]
But because is the maximizer of \big{(}-\text{logdet}(\mathrm{A}_{x}^{\top}W^{1-2/p}\mathrm{A}_{x})+(1-2/p)\mathbbm{1}^{\top}w\big{)}, the second term is zero and the proof is complete. ā
E.2.1 Proof of LemmaĀ 3.3
To differentiate in direction , we differentiate each of the matrices in the product regarding the formula of one by one. Starting from , we use the first formulation in EquationĀ (20) and we get term. Next, differentiating and in we get
[TABLE]
which is the term. Furthermore, differentiating with respect to , we get and terms. Finally note that the derivative of is:
[TABLE]
Therefore, differentiating the part in we get the , , and terms.
E.3 Derivative of
Lemma E.3**.**
The derivative of the term defined in LemmaĀ 3.3, ignoring the constants is equal to
[TABLE]
Appendix F Self-concordance Parameter of
Here we provide a bound for the self-concordance parameter of .
Lemma F.1** (Self-concordance parameter of ).**
For our hybrid barrier , the self-concordance parameter is defined as
[TABLE]
is bounded by .
Proof.
Note that for the Lewis weights and log barrier parts of the barrier we can bound the barrier parameter separately as
[TABLE]
Now for the log barrier part, we have
[TABLE]
and for the Lewis weight barrier part, from LemmasĀ E.2 andĀ 2.1:
[TABLE]
Combining EquationsĀ (108) andĀ (109) completes the proof. ā
F.1 Iteration complexity of Gaussian Cooling
Proof of CorollaryĀ 1.1.1.
First, note that from LemmaĀ F.1, is self-concordant with self-concordant parameter . The Gaussian cooling schedule introduce by authors inĀ [21] can be used to relax the requirement of a warm start for our sampling algorithm, hence obtain an efficient volume algorithm. The idea is that sampling from Gibbs distributions with smaller variance or larger is easier, so one can start from sampling a large temperature and gradually decrease it. The Gaussian cooling ofĀ [21] evolves in phases where in the th phase it generates approximate samples from the density proportional to inside the polytope, where
[TABLE]
and the update rule for is
[TABLE]
starting from until goes above . Note that the temperature parameter is given by . Now at each phase going from temperature to we have a an approximate samples from which can be used as warm starts for sampling from , specially as . Hence, our main TheoremĀ 1.1 implies that the mixing time of sampling at each phase is of order
[TABLE]
Now in the first case when , we have . On the other hand, due to the update rule of in this case, it takes phase to double and in each phase we take samples . Hence, the total number of RHMC steps to double in this case is bounded by
[TABLE]
In the other case when , we have . Then, the total RHMC steps to double in this case can be upper bounded after substituting as
[TABLE]
This means we can calculate the integral of for any using steps of RHMC up to . Moreover, if we just want to sample from in the polytope, we donāt require to take number of samples at phase but only need one sample, so the in the complexity is omitted and we end up with the complexity for sampling without warm start. ā
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Dominique Bakry, Ivan Gentil, Michel Ledoux, et al. Analysis and geometry of Markov diffusion operators , volume 103. Springer, 2014.
- 2[2] W Ballmann. Riemannian geometry and geometric analysis by j. jost; riemannian geometry by p. petersen; riemannian geometry by t. sakai. BULLETIN-AMERICAN MATHEMATICAL SOCIETY , 37(4):459ā466, 2000.
- 3[3] Jeff Cheeger, David G Ebin, and David Gregory Ebin. Comparison theorems in Riemannian geometry , volume 9. North-Holland Amsterdam, 1975.
- 4[4] Sinho Chewi. Log-concave sampling. Book draft available at https://chewisinho. github. io , 2022.
- 5[5] Sinho Chewi, Murat A Erdogdu, Mufan Li, Ruoqi Shen, and Shunshi Zhang. Analysis of Langevin Monte Carlo from PoincarĆ© to Log-Sobolev. In Conference on Learning Theory (COLT) , pages 1ā2. PMLR, 2022.
- 6[6] Ben Cousins and Santosh Vempala. Gaussian cooling and o^*(n^3) algorithms for volume and gaussian volume. SIAM Journal on Computing , 47(3):1237ā1273, 2018.
- 7[7] Arnak Dalalyan. Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. In Conference on Learning Theory (COLT) , pages 678ā689. PMLR, 2017.
- 8[8] Manfredo P Do Carmo. Differential geometry of curves and surfaces: revised and updated second edition . Courier Dover Publications, 2016.
