Sampling with Barriers: Faster Mixing via Lewis Weights

Khashayar Gatmiry; Jonathan Kelner; Santosh S. Vempala

arXiv:2303.00480·cs.DS·April 20, 2023

Sampling with Barriers: Faster Mixing via Lewis Weights

Khashayar Gatmiry, Jonathan Kelner, Santosh S. Vempala

PDF

Open Access

TL;DR

This paper improves the mixing rate bounds for Riemannian Hamiltonian Monte Carlo sampling of polytopes by introducing a hybrid barrier, leveraging new geometric analysis and extending self-concordance concepts.

Contribution

It introduces a hybrid Lewis weights and log barrier for RHMC, achieving faster mixing bounds and developing new geometric analysis tools for Markov chains on manifolds.

Findings

01

Mixing rate improved to O(m^{1/3} n^{4/3})

02

Developed a framework for analyzing Hamiltonian curves on Riemannian manifolds

03

Extended self-concordance to the infinity norm for sharper bounds

Abstract

We analyze Riemannian Hamiltonian Monte Carlo (RHMC) for sampling a polytope defined by $m$ inequalities in $R^{n}$ endowed with the metric defined by the Hessian of a convex barrier function. The advantage of RHMC over Euclidean methods such as the ball walk, hit-and-run and the Dikin walk is in its ability to take longer steps. However, in all previous work, the mixing rate has a linear dependence on the number of inequalities. We introduce a hybrid of the Lewis weights barrier and the standard logarithmic barrier and prove that the mixing rate for the corresponding RHMC is bounded by $\tilde{O} (m^{1/3} n^{4/3})$ , improving on the previous best bound of $\tilde{O} (m n^{2/3})$ (based on the log barrier). This continues the general parallels between optimization and sampling, with the latter typically leading to new tools and more refined analysis. To prove our main results, we have to…

Tables1

Table 1. Table 1 : The complexity of uniformly sampling a polytope from a warm start. All algorithms have a logarithmic dependence on the warm start parameter and each uses O ~ ( n ) ~ 𝑂 𝑛 \widetilde{O}(n) bit of randomness. The entries marked # are for general convex bodies presented by oracles, while the rest are for polytopes. The additive terms are pre-processing costs for rounding the polytope.

Year	Algorithm	Steps
1997 [13]	Ball walk^#	$n^{3}$ (+ $n^{5}$ )
2003 [24]	Hit-and-run^#	$n^{3}$ (+ $n^{4}$ )
2009 [14]	Dikin walk	$m n$
2017 [20]	Geodesic walk	$m n^{3 / 4}$
2018 [21]	RHMC with log barrier	$m n^{2 / 3}$
2020 [16]	Weighted Dikin walk	$n^{2}$
2021 [12]	Ball walk^#	$n^{2}$ (+ $n^{3}$ )
This paper	RHMC with Hybrid barrier	$m^{1 / 3} n^{4 / 3}$

Equations1367

\frac{d x}{d t}

\frac{d x}{d t}

\frac{d v}{d t}

H (x, v) = f (x) + \frac{1}{2} v^{⊤} g^{- 1} (x) v + \frac{1}{2} lo g det g (x),

H (x, v) = f (x) + \frac{1}{2} v^{⊤} g^{- 1} (x) v + \frac{1}{2} lo g det g (x),

\frac{d x}{d t}

\frac{d x}{d t}

\frac{d v}{d t}

ϕ_{p} (x) ≜ lo g det (A_{x}^{⊤} W_{x}^{1 - 2/ p} A_{x}),

ϕ_{p} (x) ≜ lo g det (A_{x}^{⊤} W_{x}^{1 - 2/ p} A_{x}),

ϕ (x) ≜ - (\frac{m}{n})^{\frac{2}{p + 2}} (lo g det A_{x}^{⊤} W_{x}^{1 - 2/ p} A_{x} + \frac{n}{m} i \sum lo g (s_{i})),

ϕ (x) ≜ - (\frac{m}{n})^{\frac{2}{p + 2}} (lo g det A_{x}^{⊤} W_{x}^{1 - 2/ p} A_{x} + \frac{n}{m} i \sum lo g (s_{i})),

min {α^{- 1} n^{2/3} + α^{- 1/3} n^{5/9} m^{1/9} + n^{1/3} m^{1/6}, n^{4/3} m^{1/3}} .

min {α^{- 1} n^{2/3} + α^{- 1/3} n^{5/9} m^{1/9} + n^{1/3} m^{1/6}, n^{4/3} m^{1/3}} .

\tilde{O} (m^{1/3} n^{4/3}) .

\tilde{O} (m^{1/3} n^{4/3}) .

O (m^{1/3} n^{4/3} lo g (M / ϵ) lo g (M))

O (m^{1/3} n^{4/3} lo g (M / ϵ) lo g (M))

\tilde{O} (m^{1/3} n^{4/3} lo g (1/ ϵ)),

\tilde{O} (m^{1/3} n^{4/3} lo g (1/ ϵ)),

max {\frac{1}{n} (\frac{n}{m})^{\frac{1}{p + 2}}, α} .

max {\frac{1}{n} (\frac{n}{m})^{\frac{1}{p + 2}}, α} .

- ∥ v ∥_{g} g ≼ D g (v) ≼ ∥ v ∥_{g} g,

- ∥ v ∥_{g} g ≼ D g (v) ≼ ∥ v ∥_{g} g,

- ∥ v ∥_{g} g ≼ D g (v) ≼ ∥ v ∥_{g} g

- ∥ v ∥_{g} g ≼ D g (v) ≼ ∥ v ∥_{g} g

- ∥ v ∥_{g} ∥ z ∥_{g} g ≼ D^{2} g (v, z) ≼ ∥ v ∥_{g} ∥ z ∥_{g} g

- ∥ v ∥_{g} ∥ z ∥_{g} ∥ u ∥_{g} g ≼ D^{3} g (v, z, u) ≼ ∥ v ∥_{g} ∥ z ∥_{g} ∥ u ∥_{g} g .

- ∥ v ∥_{x, \infty} g ≼ D g (v) ≼ ∥ v ∥_{x, \infty} g,

- ∥ v ∥_{x, \infty} g ≼ D g (v) ≼ ∥ v ∥_{x, \infty} g,

- ∥ v ∥_{x, \infty} ∥ z ∥_{x, \infty} g ≼ D^{2} g (v, z) ≼ ∥ v ∥_{x, \infty} ∥ z ∥_{x, \infty} g,

- ∥ v ∥_{x, \infty} ∥ z ∥_{x, \infty} ∥ u ∥_{x, \infty} g ≼ D^{3} g (v, z, u) ≼ ∥ v ∥_{x, \infty} ∥ z ∥_{x, \infty} ∥ u ∥_{x, \infty} g .

T V (T_{x_{0}}, T_{x_{1}}) \leq 0.01,

T V (T_{x_{0}}, T_{x_{1}}) \leq 0.01,

\nabla_{γ^{'} (t)} γ^{'} (t) = μ (γ (t)) .

\nabla_{γ^{'} (t)} γ^{'} (t) = μ (γ (t)) .

μ (x) ≜ g^{- 1} D f (x) - \frac{1}{2} g (x)^{- 1} tr [g (x)^{- 1} D g (x)], .

μ (x) ≜ g^{- 1} D f (x) - \frac{1}{2} g (x)^{- 1} tr [g (x)^{- 1} D g (x)], .

\nabla_{γ^{'} (t)} γ^{'} (t) = 0.

\nabla_{γ^{'} (t)} γ^{'} (t) = 0.

\displaystyle J(t)=\partial_{s}\gamma_{0}(t)=\partial_{s}\gamma_{s}(t)\Big{|}_{s=0},

\displaystyle J(t)=\partial_{s}\gamma_{0}(t)=\partial_{s}\gamma_{s}(t)\Big{|}_{s=0},

D_{t}^{2} J (t) = R (J, γ^{'} (t)) γ^{'} (t) .

D_{t}^{2} J (t) = R (J, γ^{'} (t)) γ^{'} (t) .

\displaystyle J^{\prime}(0)=D_{s}\frac{d}{dt}\gamma_{s}(t)\Big{|}_{s=0,t=0}=D_{s}v_{\gamma_{s}}\Big{|}_{s=0}.

\displaystyle J^{\prime}(0)=D_{s}\frac{d}{dt}\gamma_{s}(t)\Big{|}_{s=0,t=0}=D_{s}v_{\gamma_{s}}\Big{|}_{s=0}.

Ricci (γ^{'} (0), γ^{'} (0)) = i = 1 \sum n ⟨ e_{i}, R (e_{i}, γ^{'} (0)) γ^{'} (0)⟩ .

Ricci (γ^{'} (0), γ^{'} (0)) = i = 1 \sum n ⟨ e_{i}, R (e_{i}, γ^{'} (0)) γ^{'} (0)⟩ .

\forall u \in T_{x} (M), M_{x} (u) ≜ \nabla_{u} μ (x),

\forall u \in T_{x} (M), M_{x} (u) ≜ \nabla_{u} μ (x),

Φ (t) ≜ R (., γ^{'} (t)) γ^{'} (t) + M_{γ^{'} (t)} .

Φ (t) ≜ R (., γ^{'} (t)) γ^{'} (t) + M_{γ^{'} (t)} .

\tilde{J}^{''} (t) = Φ (t) \tilde{J} (t),

\tilde{J}^{''} (t) = Φ (t) \tilde{J} (t),

∥Φ (t) ∥_{F} \leq R_{1} .

∥Φ (t) ∥_{F} \leq R_{1} .

∣ D (t r (Φ (t))) (z) ∣ \leq R_{2} ∥ z ∥_{g} .

∣ D (t r (Φ (t))) (z) ∣ \leq R_{2} ∥ z ∥_{g} .

∥Φ (t) ζ (t) ∥_{g} \leq R_{3} .

∥Φ (t) ζ (t) ∥_{g} \leq R_{3} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMarkov Chains and Monte Carlo Methods · Topological and Geometric Data Analysis · Statistical Methods and Inference

Full text

Sampling with Barriers: Faster Mixing via Lewis Weights

Khashayar Gatmiry, Jonathan Kelner, Santosh S. Vempala MIT, [email protected]. Part of this work was done while visiting Georgia Tech and supported by NSF award CCF-2007443.MIT, [email protected] Tech. [email protected]. Supported in part by NSF awards CCF-2007443 and CCF-2106444.

Abstract

We analyze Riemannian Hamiltonian Monte Carlo (RHMC) for sampling a polytope defined by $m$ inequalities in $\mathbb{R}^{n}$ endowed with the metric defined by the Hessian of a convex barrier function. The advantage of RHMC over Euclidean methods such as the ball walk, hit-and-run and the Dikin walk is in its ability to take longer steps. However, in all previous work, the mixing rate has a linear dependence on the number of inequalities. We introduce a hybrid of the Lewis weights barrier and the standard logarithmic barrier and prove that the mixing rate for the corresponding RHMC is bounded by $\tilde{O}(m^{1/3}n^{4/3})$ , improving on the previous best bound of $\tilde{O}(mn^{2/3})$ (based on the log barrier). This continues the general parallels between optimization and sampling, with the latter typically leading to new tools and more refined analysis. To prove our main results, we have to overcomes several challenges relating to the smoothness of Hamiltonian curves and the self-concordance properties of the barrier. In the process, we give a general framework for the analysis of Markov chains on Riemannian manifolds, derive new smoothness bounds on Hamiltonian curves, a central topic of comparison geometry, and extend self-concordance to the infinity norm, which gives sharper bounds; these properties appear to be of independent interest.

1 Introduction
1.1 Background and Related Work
1.2 Background on Riemannian Hamiltonian Monte Carlo
1.3 Results
1.4 Technical overview
2 Preliminaries
2.1 John Ellipsoid and Lewis weights
2.2 Markov chains
3 Hybrid barrier metric and second-order self-concordance
4 Bounding conductance and mixing time
5 On the Geometry and Stability of Hessian Manifolds
5.1 Bounding $R_{1}$
5.2 Bounding $R_{2}$
5.2.1 Bounding the change in Operator $M_{x}$
5.2.2 Bounding the change in the Ricci Tensor
5.3 Bounding $R_{3}$
6 Stability of Hamiltonian curves
6.1 Stability of the niceness property
6.2 High probability bound on norms along the Hamiltonian curve
7 Isoperimetry
A Riemannian Geometry
A.1 Basic Manifold Definitions
A.2 Manifold Derivatives, Geodesics, Parallel Transport
A.2.1 Covariant derivative
A.2.2 Parallel Transport
A.2.3 Geodesic
A.2.4 Riemann Tensor
A.2.5 Ricci tensor
A.2.6 Exponential Map
A.3 Hessian manifolds
B Hamiltonian Curves and Fields on Manifold
C Third order strong self-concordance of the metric
D Derivative Stability Lemmas
D.1 Infinity norm comparisons
D.2 Lowner Inequalities
D.3 Norm of the bias
D.4 Comparison between leverage scores
D.5 Norm comparison between covariant and normal derivatives
D.6 Log barrier infinity self-concordance
D.7 Other helper Lemmas
E Remaining Proofs
E.1 Proof of Theorem 2.7
E.2 Properties of Lewis weights
E.2.1 Proof of Lemma 3.3
E.3 Derivative of
F Self-concordance Parameter of $\phi$
F.1 Iteration complexity of Gaussian Cooling

1 Introduction

Generating nearly uniform random samples from a high-dimensional polytope is a fundamental algorithmic problem with a rich history and powerful applications, notably including the only known fully polynomial-time approximation schemes for computing a polytope’s volume. All efficient algorithms known for this problem work by designing a Markov chain whose stationary distribution is uniform over the polytope and showing that it mixes in a small number of steps.

In this paper, our main result is that we can construct such a Markov chain with an improved bound on its mixing time. For a polytope given by $m$ linear inequalities in $\mathbb{R}^{n}$ , we describe chain that mixes in $\tilde{O}\left(m^{1/3}n^{4/3}\right)$ steps, improving on the best previous bound of $\tilde{O}\left(mn^{2/3}\right)$ . This allows us to approximate the volume within relative error $\epsilon$ using $\tilde{O}\left(m^{1/3}n^{4/3}/\epsilon^{2}\right)$ steps, which is a similar improvement over the best existing bound of $\tilde{O}\left(mn^{2/3}/\epsilon^{2}\right)$ .

1.1 Background and Related Work

In their seminal work [10], Dyer, Frieze and Kannan gave the first polynomial-time algorithm for this problem, as well as for the more general problem of sampling from a convex body specified by a membership oracle. The Markov chain in their algorithm was a grid walk, which takes steps along the edges of the graph obtained by intersecting the convex body with a discrete grid supported on $\delta\mathbb{Z}^{n}$ for some $\delta=1/\mathrm{poly}(n)$ . This graph is heavily dependent on the coordinate system—its diameter is proportional to the diameter of the convex body, and its conductance can be arbitrarily small if the convex body is scaled so that is very long in some directions but short in others. However, they showed that, if one changes to a basis in which the convex body is appropriately “well-rounded,” the grid walk mixes in polynomial time and that one can use a random sample from the grid to obtain a one from the convex body.

The polynomial for the mixing time in [10] was quite large, and a sequence of later papers improved this by modifying the Markov chains and refining the analysis. Because one often wants to draw many samples from the body, these papers typically provide two bounds on the number of steps required: a bound when starting from an arbitrary point and including the cost of any preprocessing; and a bound when given a warm start, where the preprocessing has already been performed and the starting point is drawn from a distribution that is not too far from uniform.

In [13], Kannan, Lovász, and Simonovits showed that a ball walk whose steps are chosen uniformly from a Euclidean ball around the current point mixes in $\tilde{O}(n^{3})$ steps from a warm start and $\tilde{O}(n^{5})$ steps from an arbitrary starting point and including preprocessing. Later, Lovász and Vempala [24] studied the “hit-and-run” walk, which chooses a line in a random direction from the current point and then picks the next point randomly from the intersection of this line with the body, and they showed it also mixed in $\tilde{O}(n^{3})$ steps from a warm start but needed only $\tilde{O}(n^{4})$ steps for first sample and preprocessing. These algorithms work on general convex bodies presented by oracles, but like the grid walk, they are strongly coordinate dependent, and they thus require strong additional assumptions about the coordinate system. In particular, analyses of these algorithms typically assume that body is close to isotropic, i.e., that the covariance matrix of a random sample from the body is approximately the identity, and applying these algorithms to more general bodies requires costly preprocessing.

The dependence on the coordinate system in the aforementioned Markov chains comes from the dependence of the transition probabilities on the extrinsic geometry of the ambient Euclidean space. The impact of this extends beyond the overhead from the isotropy requirements. The geometry of the ambient space does not incorporate any information about how close a point is to the boundary, which typically leads to difficulties making progress with steps near the boundary. For example, if one is running a ball walk with step radius $\delta$ an $n$ -dimensional cube, and the current point is some distance $d\ll\delta$ from one of the corners, a random point from the radius $\delta$ ball will lie outside the cube with probabability exponentially close to 1, so naively trying random points until obtaining one in the cube would take a large number of tries. Moreover, even if one could sample a random point in the intersection of the ball with the cube, restricting the step to points inside the cube would distort the stationary distribution, and it would no longer be uniform. Remedying such difficulties typically involves (depending on the paper) some combination of taking smaller steps, enlarging the convex body (and failing if the walk ends up at a point outside the original body), and employing rejection sampling or a Metropolis filter to correct the stationary probabilities, all of which increase the required number of steps.

For polytopes specified by an explicit collection of linear constraints, one can use the barrier functions employed by interior point methods to design random walks whose steps depend only on the intrinsic geometry of the polytope and are independent of the basis chosen for the ambient space. The idea behind these random walks is to use the Hessian of the barrier function to define a local norm/Riemannian metric on the interior of the polytope and specify the steps in terms of the resulting geometry. This mitigates some of the problems described above and has led to Markov chains whose mixing times grow with the number of constraints but depend more mildly on the dimension.

In the first such work, Kannan and Narayanan [14] introduced the Dikin walk and gave a mixing time bound of $O(mn)$ from a warm start for a polytope with $m$ facets in $\mathbb{R}^{n}$ . This walk is similar to the ball walk, but it chooses its steps from Dikin ellipsoids, which are balls with respect to the Hessian of the standard logarithmic barrier function on the polytope. In [16], Laddha, Lee, and Vempala studied the analogous walk with respect to any self-concordant barrier and showed that it mixes in $\tilde{O}(n\bar{\nu})$ steps, where $\bar{\nu}$ is a parameter they called the barrier parameter. By bounding this parameter for a different barrier function (a variant of a barrier due to Lee and Sidford [18]), they obtained an improved mixing rate bound of $\tilde{O}(n^{2})$ .

In 2017, Lee and Vempala [20] reduced the mixing rate to $\tilde{O}\left(mn^{3/4}\right)$ using a process they called the geodesic walk. Like in the Dikin Walk, the steps are constructed using the Hessian of a barrier function. However, instead of using this to define a Euclidean ellipse, they use it to define a Riemannian metric, and they then solve a differential equation in each step to follow geodesics on the resulting manifold. These geodesics tend to curve away from the polytope’s boundary, which lets them take longer steps in each iteration.

In 2018, Lee and Vempala [21] improved this to $\tilde{O}\left(mn^{2/3}\right)$ using Riemannian Hamiltonian Monte Carlo (RHMC) [11], which is the class of processes we’ll use in this paper. While there is a large literature on using RHMC and related methods to sample smooth densities [7, 9, 5, 29, 22, 4], there are relatively few provable results about applying it in constrained non-smooth settings like polytope sampling. Roughly speaking, this improvement over the geodesic walk came from RHMC’s ability to avoid the use of a Metropolis filter, which the geodesic walk requires in order to obtain the correct stationary distribution (even when the target distribution is uniform). RHMC chooses its trajectories according to a different differential equation that, remarkably, yields a reversible random walk with the desired stationary distribution, thus eliminating the need for a Metropolis filter and allowing greater progress in each step.

Advances in self-concordant barriers in the past decade as well as the improvement in the analysis of the Dikin walk suggest that a smaller dependence on $m$ , the number of inequalities, which can be much higher than the dimension, should be possible. Nevertheless, improving on the bound of $mn^{2/3}$ has been a major open problem for the past 5 years. Moreover, the new techniques developed as a result of progress on non-Euclidean algorithms suggest that this is a fertile area for further TCS research.

1.2 Background on Riemannian Hamiltonian Monte Carlo

The motivation for RHMC comes from the Hamiltonian formulation of classical Newtonian mechanics. Hamiltonian mechanics parameterizes a physical system in terms of a position vector $x$ and a corresponding momentum vector $v$ (which is also referred to as “velocity” in some prior work on sampling polytopes with RHMC). The physics of the system are encoded in its Hamiltonian $H(x,v)$ , which is simply the energy of the system written as a function of $x$ and $v$ , and its time evolution is determined by Hamilton’s equations:

[TABLE]

With the appropriate choice of $H$ , these reproduce Newton’s laws of motion, but they also generalize quite broadly, including to Riemannian manifolds.

In RHMC, one defines a Markov chain by choosing a Hamiltonian that appropriately encodes the target distribution. At each step, the Markov chain chooses a random momentum vector and then finds the next point by numerically solving a differential equation to follow the trajectory given by Hamilton’s equations.

One can show that the value of the Hamiltonian (i.e., the energy) and the volume element in the space of pairs $(x,v)$ are conserved along the trajectory, which can be used to show that the trajectories are preserved by time reversal (i.e., running time backwards). One can then use this to show that, if one uses the Hamiltonian defined below, the marginal distribution of $x$ will converge to the desired target distribution without requiring a Metropolis filter. (See [11] for the derivation for general RHMC and [21] for the specific class of Hamiltonians given below.)

More precisely, let the Hamiltonian at a point $x\in\mathbb{R}^{n}$ for a vector $v\in\mathbb{R}^{n}$ be defined as

[TABLE]

where $g(x)$ is a positive definite matrix defining a Riemannian metric at each point $x$ as $\|u\|_{g}\triangleq\|u\|_{g(x)}\triangleq\sqrt{u^{\top}g(x)u}$ , and the target density to be sampled is proportional to $e^{-f}$ restricted to the support of $g$ . One step of RHMC consists of the following: first pick $v$ from the Gaussian $\mathcal{N}(x,g(x)^{-1})$ . Then for time $\delta$ follow the Hamiltonian curve jointly on $(x,v)$ :

[TABLE]

The final $x$ at time $\delta$ is the sampled point from the Markov Kernel. A natural choice for the metric $g$ turns out to be the Hessian of a self-concordant barrier function inside the polytope $\mathcal{P}$ . The standard logarithmic barrier, $\phi_{\ell}(x)=-\sum_{i=1}^{m}\log(a_{i}^{\top}x-b_{i})$ , was used in [21] to prove that the resulting RHMC mixes in $mn^{2/3}$ steps. Improving on this bound is our motivating open problem.

Using the log barrier implies that the mixing rate has a linear dependence on $m$ , the number of inequalities. So we have to look for a “better” barrier, and what exactly this entails will become clear presently. As we will see below, the barrier parameter of the self-concordant function, which is $m$ for the logarithmic barrier, plays an important role in the mixing time of this Markov chain. Given that there are efficiently-computable barriers for which this parameter is $O(n)$ [18], one might hope to obtain faster mixing by simply replacing the logarithmic barrier with one of these. However, it turns out that just bounding the barrier parameter is insufficient, and we need to choose a barrier that also possesses certain stronger smoothness and stability properties. One of our primary technical challenges will be to define a notion that is stringent enough to guarantee the stronger properties required while still admitting a construction that improves upon the logarithmic barrier.

1.3 Results

In this paper, we use a hybrid barrier based on the $p$ Lewis weight barrier $\phi_{p}$ defined as

[TABLE]

where $\mathbf{W}_{x}$ is a diagonal matrix whose diagonal entries are the $p$ -Lewis weights of the rescaled matrix $\mathrm{A}_{x}=S_{x}^{-1}\mathrm{A}$ and $S_{x}$ is the diagonal matrix whose entries are the slacks at point $x$ , i.e., $(S_{x})_{ii}=a_{i}^{\top}x-b_{i}$ .

We define a hybrid barrier $\phi$ for a polytope as follows.

Definition 1 (Hybrid barrier).

We define the hybrid barrier $\phi$ inside a polytope $Ax\geq b$ as

[TABLE]

where $s_{i}=a_{i}^{\top}x-b_{i}$ are the slacks at point $x$ . We denote the normalizing factor of $\phi$ by $\alpha_{0}\triangleq(\frac{m}{n})^{\frac{2/p}{1+2/p}}$ .

For background on Lewis weights see Section 2. Our main theorem is a bound on the mixing rate of RHMC with this hybrid barrier.

Theorem 1.1 (Mixing).

Given a polytope $\mathcal{P}$ , let $\pi$ be the distribution with density proportional to $e^{-\alpha\phi(x)}$ over the open set inside $\mathcal{P}$ . Then, RHMC with stationary distribution $\pi$ on the manifold of the open set inside $P$ equipped with metric $g$ defined by the Hessian of the hybrid barrier $\phi$ with $p=4-(1/\log(m))$ has mixing rate bounded by

[TABLE]

In particular, for the uniform distribution over $\mathcal{P}$ (with $\alpha=0$ ), the mixing rate is

[TABLE]

More specifically, the Markov chain starting at $\pi_{0}$ reaches $\pi_{t}$ with TV-distance at most $\epsilon$ to the target after

[TABLE]

steps, where $M\triangleq sup_{x\in P}\frac{d\pi_{0}(x)}{d\pi(x)}$ and $\tilde{O}$ hide $\mathrm{polylog}(m)$ factors.

Note that without a warm start, the $\log(M)$ dependence in Theorem 1.1 could be another factor of $n$ to the mixing time. However, applying the Gaussian Cooling framework [6] extended to manifolds [21] lets us sample from $e^{-\alpha\phi}$ for any $\alpha$ without a warm start penalty, and also allows us to compute the volume of the polytope without a significant overhead.

Corollary 1.1.1 (Any start; Volume).

For the manifold Gaussian Cooling scheme in [21] with the hybrid barrier (4) applied to sample from the density $e^{-\alpha\phi(x)}$ inside a given polytope starting from $\arg\min\phi(x)$ , the total number of RHMC steps for any $\alpha\geq 0$ is bounded by

[TABLE]

Moreover, to compute the integral of $e^{-\alpha\phi}$ in the polytope and in particular the volume of the polytope up to multiplicative error $1\pm\epsilon^{\prime}$ , the total number of RHMC steps is bounded by $\tilde{O}(m^{1/3}n^{4/3}/\epsilon^{\prime 2})$ .

This improves on the previous best bound of $mn^{2/3}$ due to [21] based on the standard logarithmic barrier. The proof of Theorem 1.1 requires the development of several technical ingredients. We summarize a few that are likely to be of independent interest.

The first is a new isoperimetric inequality for this hybrid barrier (see Section 2.2 for the definition of isoperimetry).

Theorem 1.2.

[Isoperimetry of Hybrid Barrier] Let $g$ be a metric corresponding to Hessian of the hybrid barrier, with support given by a polytope defined by $m$ inequalities in $\mathbb{R}^{n}$ .

Then for $\alpha\geq 0$ , the distribution with density proportional to $e^{-\alpha\phi}$ has isoperimetric constant at least

[TABLE]

As part of the proof, we develop stronger self-concordance properties of the Lewis weight barrier. The usual self-concordance [25] for barrier $\phi$ implies a control on the third order derivative of $\phi$ by its second derivative, which can be seen as a property of the metric $g=\nabla^{2}\phi^{\prime}$ ,

[TABLE]

where $\mathrm{D}g(v)$ is the directional derivative of $g$ along direction $v$ . We will need to extend this self-concordance to third-order derivatives of $g$ . These types of estimates for the derivatives of the metric are known as Calabi estimates in the Differential Geometry literature [27, 30].

Lemma 1.3 (Manifold self-concordance of Hybrid barrier).

The hybrid barrier is third-order self-concordant with respect to the manifold’s metric $g$ , namely

[TABLE]

Here $\preccurlyeq$ is the Löwner ordering between matrices ignoring logarithmic factors. The Calabi-type estimates in Lemma 1.3 turn out to be insufficient to improve the mixing rate. Hence, as one of our main contributions, we develop a new type of self-concordance, where instead of the local norm $\|.\|_{g}$ , we measure the spectral change of the metric in a different local norm $\|.\|_{x,\infty}$ . An intuitive description of $\|.\|_{x,\infty}$ is via its unit ball; namely, $\|.\|_{x,\infty}$ is the unique norm whose unit ball is the symmetrized polytope $\mathcal{P}\cap 2x-\mathcal{P}$ around $x$ , as illustrated in Figure 1(a). ( $2x-\mathcal{P}$ is the reflection of $\mathcal{P}$ around $x$ .)

Lemma 1.4 (Infinity norm Third-order Self-concordance of Hybrid barrier).

The hybrid barrier, defined in (4), is third-order self-concordant with respect to the local infinity norm $\|.\|_{x,\infty}$ . Namely,

[TABLE]

In fact, the norm $\|.\|_{x,\infty}$ measures the ratio of the change of the distance to the $i$ th facet after taking step $v$ divided by the distance to facet $i$ , then taking maximum of this ratio over all facets. These estimates will allow us to prove important smoothness properties of certain quantities on the manifold that we are interested in. In the following, we sometimes refer to our notion of strong third-order self-concordance as infinity norm self-concordance, as it involves the local norm $\|.\|_{x,\infty}$ .

1.4 Technical overview

Mixing and Conductance.

Our general approach to bounding the mixing rate is based on bounding the conductance [23]. The standard approach to bounding the conductance of geometric walks of this type is to show an isoperimetric inequality for the underlying metric space and then prove that steps of the random walk behave well with respect to the underlying metric. Formally, we show two properties for the manifold $\mathcal{M}$ obtained by equipping the interior of the polytope $\mathcal{P}$ with the metric $g=\nabla^{2}\phi$ :

•

Isoperimetry. The target density $e^{-\alpha\phi(x)}$ has a good isoperimetry constant on $\mathcal{M}$ .

•

One-step Coupling. The one-step distributions of the Markov chain given two close-by points $x_{0},x_{1}$ on the manifold are close in TV-distance. Namely, for some parameter $\delta>0$ , after excluding a tiny set $S^{c}\subseteq\mathcal{M}$ , given any two points $x_{0},x_{1}\in\mathcal{S}$ with $d(x_{0},x_{1})\leq\delta$ we show

[TABLE]

where $\mathcal{T}_{x}$ denotes the Markov kernel starting from $x$ .

Isoperimetry.

The log barrier metric gives an isoperimetric coefficient of $1/\sqrt{m}$ , which leads to a factor of $m$ in the conductance. In principle, this can be improved to $\tilde{O}(n)$ by using a barrier with barrier parameter $\nu=\tilde{O}(n)$ , as the general bound on the isoperimetry is $1/\sqrt{\nu}$ for any strongly self-concordant barrier with barrier parameter $\nu$ [17]. While the universal and entropic barriers have $\nu=O(n)$ , they are expensive to compute. The LS barrier [18] has $\nu=\tilde{O}(n)$ while being efficient to compute. However, as we will see in more detail, as far as we know, the metric and its derivatives are not “smooth” enough in most of the directions in the tangent space, which means we would have to take rather small steps while running RHMC.

We will prove that the hybrid barrier has significantly better isoperimetry (Thm. 1.2) than the log barrier while maintaining sufficient smoothness.

Smoothness of Hamiltonian Curves and Comparison Geometry.

The starting point of our analysis is the fact that one can look at the ordinary differential equation of RHMC in Equation (2) as a second-order ODE on the manifold $\mathcal{M}$ of the open set inside the polytope with metric $g$ . We will introduce this alternative form shortly. Looking at the Markov Kernel $\mathcal{T}_{x_{0}}$ of RHMC for a fixed point $x_{0}$ , the randomness to define this kernel comes from the initial velocity $v_{0}$ , which can be viewed as a vector on the tangent space of $x_{0}$ on the manifold $\mathcal{M}$ distributed as a standard Gaussian with respect to the local metric, namely $\mathcal{N}(0,g(x)^{-1})$ in the Euclidean chart. In order to show the One-step Coupling (Lemma 6) for the Markov kernel of RHMC, we bound the difference between the densities $\mathcal{T}_{x_{0}}(y)$ and $\mathcal{T}_{x_{1}}(y)$ at a given point $y$ on the manifold. These densities are the pushforwards of the Gaussian density in the tangent space of $x_{0}$ and $x_{1}$ respectively, onto the manifold through the Hamiltonian map $Ham^{\delta}(x_{0},v_{x_{0}})$ for some fixed time $\delta$ , which maps the initial velocity $v_{x_{0}}$ to the solution of the ODE $y=x(\delta)$ at time $\delta$ . The key to bound the change of density is to understand how the Hamiltonian curves vary as we change the initial point from $x_{0}$ to $x_{1}$ for a fixed destination $y$ , given the particular geometry imposed by our hybrid barrier inside a polytope. In fact, understanding the extremal scenarios of the behavior of geometric quantities on a certain class of manifolds is the topic of Comparison Geometry [3] [26] [2]. In particular, to argue that the Hamiltonian curve changes sufficiently slowly, we need the metric $g$ of the manifold and its derivatives to be “stable”. The simplest form of stability of the metric is the so-called self-concordance property, namely, $g$ is self-concordant if the derivative of $g(x)$ in a unit direction in the tangent space is controlled by $g$ itself. This type of self-concordance for the first derivative of the metric is already known for the $p$ -Lewis weights barrier [19]. However, this notion of stability is too weak for our use since a typical Gaussian vector $v$ in the tangent space of $x$ has norm of order $\|v\|_{g}\sim\sqrt{n}$ . Nonetheless, one can hope to obtain estimates for $Dg(v)$ with respect to a different norm whose value is typically much smaller than the $\|.\|_{g}$ norm. We show that self-concordance of the metric of the $p$ -Lewis weights barrier for $p<4$ with respect to the infinity norm of a re-parameterized version of $v$ is effective for characterizing the stability of Hamiltonian curves. This local infinity norm, which we denote by $\|.\|_{x,\infty}$ , can be regarded as the maximum ratio of the length of $v$ projected onto the normal of a facet divided by the distance of $x$ from that facet; its unit ball is the symmetrized polytope $\mathcal{P}\cap 2x-\mathcal{P}$ around $x$ . Importantly, one can see that for a typical Gaussian vector $v\sim\mathcal{N}(0,g^{-1})$ , $\|v\|_{x,\infty}$ is of order $\tilde{O}(1)$ instead of $\sqrt{n}$ . In fact, the $\|.\|_{x,\infty}$ norm of the tangent vector to the RHMC curve remains small for all times with high probability. This is favorable as we need a bound on the rate of change of the density only for typical values of $v$ and can ignore sets with small probability in bounding the conductance. An important part of our contribution is to derive self-concordance estimates for the derivatives of the metric of the $p$ -Lewis weights for $p<4$ up to third order, with respect to this $\|.\|_{x,\infty}$ local norm. We introduce our approach up to second order self-concordance in Section 3 and defer the third-order self-concordance to Appendix C. Although the number of terms that are created from differentiating the Lewis weights metric up to third order grows quite large, many subtensors are common, which enables us to treat in a similar fashion. To avoid repetition, we gather the common Löwner inequalities that we use for various matrices in section D which we reuse to prove the self-concordance of the $p$ Lewis weights barrier. The infinity norm third-order self-concordance of the hybrid barrier follows from combining the infinity norm third-order self-concordance of the $p$ -Lewis weights barrier and the log barrier (see section 3).

The $p<4$ threshold is essential to obtain our estimates. In particular, we can still control the derivative of the metric $Dg(v)$ with respect to $\|v\|_{g}$ for the LS barrier, which is a $p$ Lewis weights barrier for polylogarithmically large $p$ , but it is an overestimate of the $\|.\|_{x,\infty}$ norm with high probability for a Gaussian vector in the tangent space of $x$ . Nonetheless, for small $p$ ’s the ellipsoid of the $p$ -Lewis weights does not approximate the symmetrized polytope as well as larger $p$ ’s; in particular a large portion of the ellipsoid lies outside the symmetrized polytope. This means that we need to scale down the unit norm ellipsoid so that it fits inside the polytope, which then means we have to to scale it up by a larger constant to make it contain the symmetrized polytope. As a result, the barrier parameter is large (see [16] for definition of barrier parameter), which in turn results in a poor isoperimetric constant.

We would like to have an ellipsoid at each point $x$ inside the polytope that approximates the symmetrized polytope around $x$ more accurately and is also stable as $x$ moves in random directions. For this, we go back to an idea of Vaidya from optimization and use a hybrid barrier by “regularizing” the $p$ -Lewis weight barrier for $p<4$ with the standard log barrier We can give a better bound on the barrier parameter of this hybrid barrier compared to the log barrier, which implies that the corresponding metric has better isoperimetry. Moroever, the regularization does not harm the stability of the metric as the log barrier already enjoys stability with respect to the local infinity norm $\|\|_{x,\infty}$ . In particular, we show that our hybrid barrier has stable higher-order derivatives in arbitrary directions based on the local norm $\|.\|_{x,\infty}$ . The particular choice of our barrier is essential to simultaneously prove third order infinity-norm self-concordance and good isoperimetry.

Hamiltonian curves and variations.

To see the high-level idea of how we show the one-step coupling of the Markov kernel, consider the shortest path between two points $x_{0}$ and $x_{1}$ , which is a geodesic on the manifold. Geodesics are generalization of straight lines in the Euclidean space to arbitrary manifolds and naturally define the curve with the smallest possible length between two points on the manifold. Let the curve $\gamma_{s}$ , parameterized by $s\in[0,s^{\prime}]$ , be a length-minimizing geodesic connecting $x_{0}=\gamma_{0}$ to $x_{1}=\gamma_{s^{\prime}}$ with distance $d(x_{0},x_{1})$ . Suppose that running the Hamiltonian ODE with initial location $x_{0}\in\mathcal{P}$ and initial velocity $v_{x_{0}}$ up to time $\delta$ takes us to a point $y$ on the manifold. As we start moving toward $x_{1}$ on the geodesic, $\gamma_{s}$ parameterized by $s\in[0,s^{\prime}]$ , we consider the variation of the initial Hamiltonian curve; namely a family of Hamiltonian curves parameterized by $s$ , where the $s$ -curve starts from point $\gamma_{s}$ , perhaps with a different initial velocity $v_{\gamma_{s}}$ , but ends up to the same destination $y$ at time $\delta$ . The geodesic $\gamma_{s}$ from $x_{0}$ to $x_{1}$ and the corresponding Hamiltonian curves are illustrated in Figure 2.

Looking at the the value of the density $\mathcal{T}_{\gamma_{s}}(y)$ at point $y$ after taking one step of the Markov chain starting from $\gamma_{s}$ , we observe it depends on two major components: (1) the Gaussian density of the initial velocity $v_{\gamma_{s}}$ which is proportional to $\exp{\{-\frac{\|v_{\gamma_{s}}\|_{g}^{2}}{2}\}}$ , and (2) the determinant of the Jacobian or the differential of the map from the initial velocity $v_{\gamma_{s}}$ to the destination point $y$ , denoted by $J^{v_{\gamma_{s}}}_{y}$ . Therefore, to study how quickly the density $\mathcal{T}_{\gamma_{s}}(y)$ changes from $x_{0}$ to $x_{1}$ , we need to study the rate of change of the initial velocities $v_{\gamma_{s}}$ and the Jacobians $J^{v_{\gamma_{s}}}_{y}$ ; the latter will depend on the rate of change of the Ricci tensor on the manifold. To study the variation of the Hamiltonian curve, we start by defining these manifold concepts.

As we mentioned earlier, one can identify the location variable $x$ in the Hamiltonian ODE (2) as a point on the manifold $\mathcal{M}$ with metric $g$ , and the velocity variable $v$ as a vector in the tangent space of $x$ , $T_{x}(\mathcal{M})$ . Then, one can write the Hamiltonian ODE in Equation (2) as a second-order ODE on the manifold $\mathcal{M}$ using the covariant derivative of $\mathcal{M}$ , illustrated in Lemma 1.5. For background on Riemannian geometry and covariant differentiation, we refer the reader to Appendix A.

Lemma 1.5.

The Hamiltonian ODE in Equation 2 can be written using the covariant derivative of the manifold in a simplified form:

[TABLE]

Above, $\nabla$ is the covariant derivative and $\mu(x)$ is the bias (drift) vector field of the Hamiltonian curve, defined as

[TABLE]

In the above notation, $\texttt{tr}[g(x)^{-1}\mathrm{D}g(x)]$ is a vector whose $i$ th entry is $\texttt{tr}[g(x)^{-1}\mathrm{D}_{i}g(x)]$ . See Appendix B for a proof of Lemma 1.5. The above ODE (7) for Hamiltonian curves is similar to the second order ODE for geodesics; for the latter the bias vector $\mu$ is zero, i.e., the geodesic Equation is given by [8]

[TABLE]

In physics, the Hamiltonian ODE in Equation 7 is important as it models the motion of a particle on a manifold acting under a force field devised by $\mu$ . Next, we define the notion of a family of Hamiltonian curves.

Definition 2 (Family of Hamiltonian curves).

We say $\big{(}\gamma_{s}(t)\big{)}$ is a family of Hamiltonian curves ending at some fixed $y$ whose starting point varies from $x_{0}=\gamma_{0}(0)$ to $x_{1}=\gamma_{s_{1}}(0)$ if for every fixed time $0\leq s\leq s_{1}$ , $\gamma_{s}(t)$ is a Hamiltonian curve in $t$ , and $\gamma_{s}(0)$ as a function of $s$ is a geodesic on $\mathcal{M}$ from $x_{0}$ to $x_{1}$ . Unless specified otherwise, whenever we talk about the curve $\gamma_{s}(t)$ we mean the curve $\gamma_{s}(t)$ as a function of $t$ for a fixed $s$ . We write $\gamma^{\prime}_{s}(t)=\partial_{t}\gamma_{s}(t)$ to refer to the derivative of the curve with respect to $t$ .

Before studying the variations of Hamiltonian fields, to given some high level intuition, we start by variations of geodesics here. More precisely, suppose $\gamma_{s}(t)$ is a variation of geodesics, i.e. $\gamma_{s}(t)$ is a geodesic in $t$ for every fixed $s\in[0,s^{\prime}]$ (recall that the curve $\gamma_{0}(s)$ in parameter $s$ is also a geodesic from $x_{0}$ to $x_{1}$ ). For brevity, we sometimes refer to the curve $\gamma_{0}(t)$ by $\gamma(t)$ . To see how fast the geodesics $\gamma_{s}(t)$ changes as a function of $s$ at time $s=0$ , for a fixed $t$ we take the derivative of $\gamma_{s}(t)$ with respect to $s$ at time $s=0$ ; this gives us a vector field $J(t)$ along $\gamma_{0}(t)$ :

[TABLE]

This vector field, called a Jacobi field, is a fundamental object in studying the variations of geodesics. Importantly, one can write a second-order ODE to describe how $J(t)$ evolves along the geodesic given initial conditions $J(0),J^{\prime}(0)$

[TABLE]

where the second derivative $J^{\prime\prime}(t)$ is the covariant derivative on the manifold with respect to $\gamma^{\prime}_{0}(t)$ , i.e., $D_{t}\triangleq\nabla_{\gamma^{\prime}_{0}(t)}$ , and $R$ is the Riemann tensor. We will provide some intuition on the role of Riemann tensor and its role in the behavior of geodesics presently. An important point to observe here is that the covariant derivative of $J$ at $t=0$ is equal to the covariant derivative of the initial velocity of the geodesic, namely $\frac{d}{dt}\gamma_{s}(t)$ , with respect to $s$ (see Lemma A.4 for a proof):

[TABLE]

So the initial values that uniquely specify the Jacobi field $J$ are $J(0)$ , which specifies how fast we change the starting point of the geodesic, and $D_{s}v_{\gamma_{s}}$ , which is how fast we change the initial velocity of the geodesic. This means that one can study the Jacobi field ODE to obtain estimates on how fast the initial velocity should change along the geodesic from $x_{0}$ to $x_{1}$ , for this family of Hamiltonian curves with the same destination $y$ . Now consider a direction $e$ perpendicular to the velocity $\gamma^{\prime}(t)=\gamma^{\prime}_{0}(t)$ of the geodesic at time $t$ , i.e., $\langle\gamma^{\prime}(t),e\rangle_{g}=0$ . Looking at the dot product of the vector $R(e,\gamma^{\prime}(t))\gamma^{\prime}(t)$ on the right hand side of the Jacobi field ODE in (10) to $e$ itself, the quantity $\langle e,R(e,\gamma^{\prime}(t))\gamma^{\prime}(t)\rangle$ is intuitively measuring how much the Jacobi field is growing or shrinking in direction $e$ , meaning whether the geodesics $\gamma_{s}(t)$ parameterized by $s$ are converging or diverging in direction $e$ at time $s=0$ . This quantity is known as the sectional curvature of the plane spanned by $e$ and $\gamma^{\prime}(t)$ . Now consider a unit orthonormal parallelepiped at time $t=0$ , denoted by a set of orthonormal vectors $\{e_{i}\}_{i=1}^{n}$ in the tangent space of $\gamma(0)$ , where $e_{1}=\gamma^{\prime}(0)$ , and look at the evolution of its volume along the geodesic when each $e_{i}$ evolves according to the Jacobi Equation; in each directions $e_{i}$ , the parallelepiped is either expanding or squeezing, depending on if the geodesics are converging or diverging in that direction which depends on the sign of the sectional curvature $\langle e_{i},R(e_{i},\gamma^{\prime}(0))\gamma^{\prime}(0)\rangle$ . Indeed, one can characterize the rate of change of this parallelepiped along the geodesic by summing the sectional curvatures for all $\{e_{i}\}_{i=2}^{n}$ ; this is the Ricci curvature of the manifold at $\gamma(0)$ in the direction $\gamma^{\prime}(0)$ :

[TABLE]

Note that the Ricci curvature is nothing but the trace of the Riemann tensor $R(.,\gamma^{\prime}(0))\gamma^{\prime}(0)$ . On the other hand, the determinant of the Jacobian $J^{v_{\gamma_{s}}}_{y}$ of the Hamiltonian map, a quantity of our interest to bound the change of density from $x_{0}$ to $x_{1}$ , can be characterized by the ratio of the volume of this parallelepiped at the beginning and the ending time $t$ . Indeed, we see later on that the log determinant of $J^{v_{\gamma_{s}}}_{y}$ can be written as a time-weighted integral of the Ricci curvature along the geodesic.

One can extend these arguments to variations of Hamiltonian curves instead of geodesics. As a result, instead of the Riemann tensor in the Jacobi fields Equation (10), we end up with a slightly different operator $\Phi(t)$ which can be decomposed into a “geometric part,” the Riemann tensor, and a “bias part,” $M_{x}$ , which comes from the derivative of the Hamiltonian bias $\mu(x)$ , defined in Equation (8). We define this fundamental operator rigorously.

Definition 3 (Operators $\Phi$ and $M_{x}$ ).

At any point $x\in\mathcal{M}$ , we define the operator $M_{x}$ as

[TABLE]

where $\nabla$ is the covariant derivative on the manifold and $\mu$ is the Hamiltonian bias. Given the Hamiltonian curve $\gamma(t)$ , we define the operator $\Phi(t)$ on the tangent space $T_{\gamma(t)}(\mathcal{M})$ as

[TABLE]

where $R$ is the Riemann tensor.

Similar to Jacobi fields, for a given family of Hamiltonian curves $(\gamma_{s}(t))$ , one can write a second order ODE for the variational vector field $\tilde{J}(t)=\frac{d}{ds}\gamma_{s}(t)$ along the Hamiltonian curve, which depends on operator $\Phi$ (for the proof see Appendix B):

Lemma 1.6 (ODE for Hamiltonian fields).

Given a family of Hamiltonian curves $\big{(}\gamma_{s}(t)\big{)}$ , the vector field $\tilde{J}(t)\triangleq\partial_{s}\gamma_{s}(t)\Big{|}_{s=0}$ is characterized by the following second order ODE:

[TABLE]

where $\Phi(t)$ is defined in 3. We refer to $\tilde{J}$ as a Hamiltonian field.

The difference between the ODE of Hamiltonian fields 12 and that of Jacobi fields 10 comes from the fact that the primary Hamiltonian Equation (7) includes an additional bias vector $\mu$ compared to the geodesic Equation (9).

Now similar to the case of variations of geodesics, for variation of Hamiltonian curves, the log determinant of the Jacobian of the Hamiltonian map $J_{y}^{v_{\gamma_{s}}}$ can be characterized by a weighted integral of the trace of $\Phi(t)$ instead of the Ricci tensor. Therefore, to study the rate of change of $\det(J_{y}^{v_{\gamma_{s}}})$ as we move from $x_{0}$ to $x_{1}$ , we need to study the rate of change of $\texttt{tr}(\Phi(t))$ along the variation of Hamiltonian curves $(\gamma_{s}(t))$ , which in turn depends on the rate of change of the Ricci tensor and the trace of operator $M_{x}$ , the two parts of the operator $\Phi(t)$ . These ideas are formalized as the $(R_{1},R_{2},R_{3})$ -normality of the Hamiltonian curve in the definition below.

Definition 4.

We say a Hamiltonian curve $\gamma(t)$ is $(R_{1},R_{2},R_{3})$ -normal up to time $\delta$ if for all $0\leq t\leq\delta$ if it satisfies the following:

•

Bound on the Frobenius norm of $\Phi$ (with respect to the metric $g$ ):

[TABLE]

•

For all times $0\leq t\leq\delta$ and unit direction $z$ in the tangent space of $\gamma(t)$ :

[TABLE]

•

For $\zeta(t)$ defined as the parallel transport of $\gamma^{\prime}(0)$ along the curve:

[TABLE]

Parallel transport of a vector on the manifold is a generalization of shifting vectors in Euclidean space, using the covariant derivative of the manifold (see Appendix A for the rigorous definition.) In order to show the $(R_{1},R_{2},R_{3})$ -normal property for the family of Hamiltonian curves, we need to define a more fundamental regularity condition for the Hamiltonian curves which states that both $\|.\|_{g}$ and $\|.\|_{x,\infty}$ norms remain small for the tangent vector along the Hamiltonian curve.

Definition 5 (Nice Hamiltonian curve).

We say a Hamiltonian curve $\gamma(t)$ is $(\delta,c)$ -nice if for $0\leq t\leq\delta$ :

[TABLE]

In order to show the closeness of one step distributions between $x_{0}$ and $x_{1}$ , we need the $(R_{1},R_{2},R_{3})$ -normality for the family of Hamiltonian curves $(\gamma_{s}(t))$ for all $0\leq s\leq\delta$ as we defined in 4. Therefore, we need to show that the $(c,\delta)$ -niceness property is stable for our hybrid barrier. We show this in Lemma 1.7, proved in Section 6. Our $(c,\delta)$ -niceness framework is a simpler and more general framework and avoids the technical machinery of auxiliary functions on curves used in [21], which needs additional parameters that need to be bounded.

Lemma 1.7 (Stability of norms).

In the same setting as Theorem 1.8, given a family of Hamiltonian curves $\gamma_{s}(t)$ for which $\gamma_{0}(t)$ is $(c,\delta)$ -nice for

[TABLE]

then $(\gamma_{s}(t))$ is a $(O(c),\delta)$ -nice family of Hamiltonian curves in the interval $s\in(0,\delta)$ .

A major part of our contribution is that we relate this abstract notion of $(R_{1},R_{2},R_{3})$ -normality to (a generalized notion of) metric self-concordance or Calabi-type estimates, which (1) crucially uses a different notion of norm to bound the derivatives of the metric and (2) needs to be satisfied for higher derivatives of the metric up to third order. Our framework can potentially be reused on other manifolds and distributions.

Theorem 1.8 (Smoothness).

Given a Hessian manifold defined by the metric $g=\nabla^{2}\phi$ for our hybrid barrier (see Definition 1) for $p<4$ , define a Hamiltonian curve $\gamma(t)$ by the ODE in Equation (7) with target log density $f=\alpha\phi$ . Assume that $\gamma$ is $(c,\delta)$ -nice (see definition of niceness in 5), then it is also $(R_{1},R_{2},R_{3})$ -normal with parameters

[TABLE]

Proof.

The result follows from the key Lemmas 5.1, 5.5, and 5.15. ∎

To understand the effect of self-concordance on the density of the push-forward measure, note that the more slowly the metric changes, the more slowly the geodesics will converge or diverge from one another, so we have smaller scalar and Ricci curvatures. As an example, one can see that the Ricci curvature $\texttt{Ricci}(\gamma^{\prime}(t),\gamma^{\prime}(t))$ can be written formally using the metric and its first derivative on Hessian manifolds (see Equation (90)). As a result, the rate of change of the Ricci tensor, which corresponds to the $R_{2}$ parameter in Definition 1.8, depends on the derivatives of the metric $g$ up to second order, and in particular can be bounded efficiently given that the metric satisfies some form of second-order self-concordance. In this regard, a question that comes up is the following: in which norm should we measure the self-concordance of the metric?

A key to notice here is that in measuring the change of $\texttt{Ricci}(\gamma^{\prime}(t),\gamma^{\prime}(t))$ , the Ricci tensor itself involves the change of the metric in a random direction as we can show that $\gamma^{\prime}(t)$ , the tangent of the Hamiltonian curve, is distributed as a Gaussian. Now if one uses the conventional framework of self-concordance in optimization which measures the derivative of the metric in direction $v$ with respect to its local norm $\|v\|_{g}$ , then the typical value of the quantity $\|\gamma^{\prime}(t)\|_{g}$ is of order $\sqrt{n}$ . This indicates a major reason we choose to measure self-concordance in the $\|.\|_{x,\infty}$ norm, which is $\tilde{O}(1)$ for a typical Gaussian vector $\mathcal{N}(0,g^{-1})$ . Importantly, we use our third-order infinity norm self-concordance in Lemma 1.4 in a black-box manner to show the $(R_{1},R_{2},R_{3})$ -normality of the Hamiltonian curve. On the other hand, even though the log barrier satisfies this type of self-concordance with respect to $\|.\|_{x,\infty}$ , it does not approximate the local geometry of the polytope well, which results in poor isoperimetry and slow mixing. For this reason, we develop infinity-norm self-concordance for the $p$ -Lewis weights barrier whose local ellipsoids are better approximations for the symmetrized polytope. Our approach to develop the infinity norm self-concordance estimates crucially depends on $p<4$ . Therefore, to further enhance the isoperimetry of the metric, we regularize the Lewis weights barrier with the log barrier, which results in our final hybrid barrier in Equation (4).

Structure of the paper.

The rest of the paper is organized as follows: In Section 2 we discuss the basic tools and notation that we use throughout the paper. In Section C, we give our proof of second-order infinity norm self-concordance estimates for the Lewis weights barrier (we defer the proof of strong third-order self-concordance to Appendix C). In Section 4, we bound the mixing time by combining multiple components, namely the stability of the Hamiltonian curves, the isoperimetry of the stationary distribution with respect to the chosen metric, and the smoothness of the manifold with our hybrid barrier. We relate the change of density of the Markov kernel between two points to the smoothness of the manifold. In Section 5, we show how we use infinity-norm third-order self-concordance to control the smoothness of the metric. Namely, we bound the norm of an important operator $\Phi$ related to the Riemann tensor and the Hamiltonian potential, which appears in the ODE of variations of Hamiltonian curves (parameter $R_{1}$ ). To bound the determinant of the Jacobian of the RHMC map, which is a component in the pushforward density of the Gaussian distribution in the tangent space onto the manifold, we bound the rate of change of the trace of $\Phi$ , which includes the Ricci tensor (parameter $R_{2}$ ) and another component originating from the Hamiltonian bias $\mu$ . Finally, we bound the norm of $\Phi$ applied to the initial velocity of the Hamiltonian curve parallel transported along the curve. In Section 6, we prove the stability of the smoothness properties of the Hamiltonian curves as we start varying the initial location and velocity of the curve. In Section 7, we prove an isoperimetry inequality on the Riemannian manifold $\mathcal{M}$ equipped with metric $g$ , the Hessian of our hybrid barrier. In Appendix A, we give some background on Differential Geometry. In Appendix B, we describe how to derive the second order Hamiltonian ODE based on the covariant derivative on the manifold. In Appendix C, we show the infinite-norm third-order self-concordance of the metric for our hybrid barrier (4). Appendix D is devoted to obtaining spectral bounds for the derivatives of our metric, which includes Lewis weights and its derivatives, which we use in our self-concordance arguments. Finally, in Appendix E we include missing proofs.

2 Preliminaries

To work with the metric $g$ imposed by our hybrid barrier $\phi$ , it is convenient to rescale the rows of the LP matrix $\mathrm{A}$ by the slack variables, namely we define

[TABLE]

In our equations we treat hadamard product of matrices with higher priority, namely $AB\odot C$ is equivalent to $A(B\odot C)$ . We refer to the $p$ -Lewis weights vector of $A_{x}$ by $w_{x}$ and its diagonal matrix version by $\mathbf{W}_{x}\triangleq\texttt{Diag}\big{(}{w_{x}}\big{)}$ . To work with a vector $v$ in the tangent space of $x$ , there is an important reparameterization of $v$ defined as

[TABLE]

Define the log barrier by $\phi_{\ell}$ :

[TABLE]

We denote the Hessian of the log barrier by $g_{2}=\nabla^{2}\phi_{\ell}(x)$ . We see $g_{1}$ as a metric inside the polytope, such that for $v\in\mathbb{R}^{n}$ it defines a local metric $\|v\|_{g_{2}}^{2}=v^{\top}g_{2}v$ . It is easy to check that the norm of a vector $v$ with respect to $g_{2}$ , i.e. $v^{\top}g_{2}v$ , is given by the $\ell_{2}$ norm of the reparameterized vector $s_{x,v}$ defined in Equation (13).

[TABLE]

For a given point $x$ inside polytope $\mathcal{P}$ , we define the symmetrized polytope $\mathcal{P}\cap 2x-\mathcal{P}$ around $x$ as the following: we reflect $\mathcal{P}$ around $x$ and intersect it with the $\mathcal{P}$ namely $\mathcal{P}\cap 2x-\mathcal{P}$ , as illustrated in Figure 1(a). The approximation of the symmetrized body by the ellipsoids corresponding to the Hessian of the barrier function plays a key role in bounding the isoperimetry constant, as we describe in Section 7.

2.1 John Ellipsoid and Lewis weights

Proving good isoperimetry for a specific barrier can be reduced to how well the ellipsoids corresponding to the Hessian of the barrier at each point $x$ inside the polytope approximate the symmetrized polytope around $x$ . A natural way to approximate a symmetric polytope is via its John Ellipsoid, i.e. the ellipsoid of maximum volume contained in the polytope. Parametrizing the John ellipsoid as $A_{x}^{\top}WA_{x}$ for a positive diagonal matrix $W$ , i.e., a weighted sum of the outer product of the rows of $A_{x}$ , the weights are characterized by the following optimization problem:

[TABLE]

where $\mathrm{W}=\texttt{Diag}\big{(}{w}\big{)}$ is the diagonal matrix corresponding to the vector $w$ . The John ellipsoid approximates the symmetrized polytope in the sense that (1) it is inside the ellipsoid and (2) scaling it up by $\sqrt{n}$ will make it contain the symmetrized polytope.

On the other hand, in order to prove smoothness of the HMC curves, we need to pick a barrier whose Hessian does not change too fast as a function of $x$ . Unfortunately the John ellipsoid is not stable. In particular, the weights $\mathrm{W}$ which maximizes (14) are not even continuous with respect to $x$ . An alternative is to use the $p$ -Lewis weights to define the ellipsoid, obtained as the solution to a relaxation of the program in (14):

[TABLE]

where $\mathrm{W}=\texttt{Diag}\big{(}{w}\big{)}$ . Moreover, the optimal value of the program in (15) is denoted by the $p$ Lewis weights barrier at $x$ as defined next.

Definition 6 (Lewis weights barrier).

The $p$ -Lewis weights barrier can be defined as the solution of the following optimization problem:

[TABLE]

Let $g_{1}=\nabla^{2}\phi_{p}$ be the metric defined by the Hessian of the $p$ Lewis weights barrier. It is known (Lemma 31 in [19]) that the ellipsoid corresponding to $g_{1}$ is roughly the same as the one defined by the Lewis weights, i.e. $\mathrm{A}_{x}^{\top}\mathbf{W}_{x}\mathrm{A}_{x}$ .

Lemma 2.1 (Lewis weights metric).

For the Lewis weight barrier $\phi_{p}$ we can bound the local norm of its Hessian as

[TABLE]

where for a vector $s_{x,v}\in\mathbb{R}^{m}$ ,

[TABLE]

Equivalently

[TABLE]

Next, we define another important local norm at a point $x$ inside the polytope:

[TABLE]

This norm plays a key role in our definition of strong self-concordance in Equation (5). For any point $x$ inside the polytope, we define $\mathbf{P}_{x}$ to be the projection matrix of $\mathrm{A}_{x}$ reweighted by $\mathbf{W}_{x}^{1-2/p}$

Definition 7 (Projection matrix).

we define the projection matrix $\mathbf{P}_{x}$ , implicitly depending on $x$ , as

[TABLE]

where $\mathbf{W}_{x}$ is the $p$ -Lewis weights calculated at $x$ . Moreover, we denote the Hadamard square $\mathbf{P}_{x}^{\odot 2}$ of the projection matrix by $P^{(2)}$ :

[TABLE]

To show the estimates in Lemma 1.4 for the $p$ -Lewis-weights barrier $\phi_{p}$ , we need to calculate the derivatives of the Lewis weights. The following Lemma presents the form of the Jacobian of the lewis weights as a function of $x$ , by taking its directional derivative in direction $v$ .

Lemma 2.2 (Derivative of the Lewis weights).

For arbitrary direction $v\in\mathbb{R}^{n}$ , the directional derivative $\mathrm{D}\mathbf{W}_{x}(v)$ can be calculated as

[TABLE]

where we define

[TABLE]

Due to the importance and repetition of the vector $\Lambda\mathbf{G}_{x}^{-1}\mathbf{W}_{x}s_{v}$ in our calculations later on, we give it a separate notation

[TABLE]

Then, the derivative of $\mathbf{W}_{x}$ can be written as

[TABLE]

In the above Lemma, note that $\mathbf{\Lambda}_{x}$ , $\mathbf{G}_{x}$ , $r_{x,v}$ , and ${\mathrm{R}_{x,v}}$ are all functions of the location variable $x$ , but we drop $x$ for clarity in our calculations. Furthermore, when $v$ is clear from the context, we denote $\mathrm{D}\mathbf{W}_{x}(v)$ in short by $\mathbf{W^{\prime}}_{x,v}$ . Next, we calculate the derivative of the projection matrix $\mathbf{P}_{x}$ onto the column space of $\mathbf{W}_{x}^{1/2-1/p}\mathrm{A}_{x}$ which is appropriately reweighted by the Lewis weights, as defined in Definition 7.

Lemma 2.3 (Derivative of the projection matrix).

The derivative of the projection matrix $\mathbf{P}_{x}=\mathbf{P}(\mathbf{W}_{x}^{1/2-1/p}\mathrm{A}_{x})$ in direction $v$ is given by

[TABLE]

where ${\mathrm{R}_{x,v}}$ is defined in Equation (19). When $v$ is clear from the context, we refer to $\mathbf{P}_{x}{\mathrm{R}_{x,v}}\mathbf{P}_{x}$ by $\mathbf{\tilde{P}}_{x,v}$ for brevity. Moreover, controlling the spectral norm of the diagonal matrix ${\mathrm{R}_{x,v}}=\texttt{Diag}\big{(}{\mathbf{G}_{x}^{-1}\mathbf{W}_{x}s_{x,v}}\big{)}$ by the infinity norm of $s_{x,v}$ is one of the key ideas that allows us to break the mixing time.

To reduce notation, in the proof we also make the dependence of $A_{x}$ to $x$ implicit and drop the index $x$ .

We denote the target probability distribution inside the polytope by $\pi(.)$ . We use $g$ for the Hessian of our hybrid barrier $\phi$ . We refer to the Hessian of the Lewis-p-weight before rescaling by $g_{1}$ , and the Hessian of $n/m$ scaled log barrier by $g_{2}$ , i.e.

[TABLE]

Throughout the proof, we use the notation $\lesssim$ to show an inequality with ignoring the logarithmic factors. We use $D$ for Euclidean derivative and $\nabla$ and $D_{t}$ for covariant differentiation with respect to the metric structure on the manifold. Moreover, we use $\preccurlyeq$ to show Löwner inequalities up to universal constants.

2.2 Markov chains

For a Markov chain with state space $\mathcal{M}$ , stationary distribution $Q$ and next step distribution $p_{u}(\cdot)$ for any $u\in\mathcal{M}$ , the conductance of the Markov chain is defined as

[TABLE]

The conductance of an ergodic Markov chain allows us to bound its mixing time, i.e., the rate of convergence to its stationary distribution, e.g., via the following theorem of Lovász and Simonovits.

Theorem 2.4.

Let $Q_{t}$ be the distribution of the current point after $t$ steps of a Markov chain with stationary distribution $q$ and conductance at least $\Phi$ , starting from initial distribution $Q_{0}$ . For any $\varepsilon>0$ ,

[TABLE]

To bound the conductance, we will reduce it to geometric isoperimetry.

Definition 8.

The isoperimetry of a metric space $\mathcal{M}$ with target distribution $\pi$ is

[TABLE]

where $d$ is the shortest path distance in $\mathcal{M}$ .

For a proof of the following theorem, see e.g., [28].

Lemma 2.5.

Given a metric space $\mathcal{M}$ and a time-reversible Markov chain $p$ on $\mathcal{M}$ with stationary distribution $Q$ , fix any $r>0$ and suppose that for any $x,y\in\mathcal{M}$ with $d(x,z)<r$ , we have $d_{TV}(p_{x},p_{y})\leq 0.9$ . Then, the conductance of the Markov chain is $\Omega(r\psi)$ .

We will need a more refined notion of $s$ -conductance, to be able to ignore small subsets when proving isoperimetry.

Definition 9 ( $s$ -conductance).

Consider a Markov chain with a state space $\mathcal{M}$ , a transition distribution $\mathcal{T}_{x}$ and stationary distribution $\pi$ . For any $s\in[0,1/2)$ , the $s$ -conductance of the Markov chain is defined by

[TABLE]

A lower bound on the $s$ -conductance of a Markov chain leads to an upper bound on its mixing rate.

Lemma 2.6.

[23]** Let $\pi_{t}$ be the distribution of the points obtained after $t$ steps of a lazy reversible Markov chain with the stationary distribution $\pi$ . For $0<s\leq 1/2$ and $H_{s}=\sup\{|\pi_{0}(A)-\pi(A)|:A\subset\mathcal{M},\,\pi(A)\leq s\}$ , it follows that

[TABLE]

The following theorem (see [15]) illustrates how one-step coupling with the isoperimetry leads to a lower bound on the $s$ -conductance. Its proof is similar to that of Lemma 13 in [21] and can be found in full detail in Appendix E.1.

Theorem 2.7.

For a Riemannian manifold $(\mathcal{M},g)$ , let $\pi$ be the stationary distribution of a reversible Markov chain on $\mathcal{M}$ with a transition distribution $P_{x}$ . Let ${\mathcal{M}}^{\prime}\subset{\mathcal{M}}$ be a subset with $\pi({\mathcal{M}}^{\prime})\geq 1-\rho$ for some $\rho<\frac{1}{2}$ . We assume the following one-step coupling: if $d_{g}(x,x^{\prime})\leq\Delta\leq 1$ for $x\in{\mathcal{M}}^{\prime}$ , then $d_{TV}(\mathcal{T}_{x},\mathcal{T}_{x^{\prime}})\leq 0.9$ . Then for any $\rho/(\Delta\psi_{\mathcal{M}})\leq s<\frac{1}{2}$ and given $\psi_{\mathcal{M}}\Delta\leq 1/2$ , the $s$ -conductance is bounded below by

[TABLE]

3 Hybrid barrier metric and second-order self-concordance

The goal of this section is to prove the strong self-concordance properties of our hybrid barrier as defined in Lemma 1.3. We start by developing some basic properties of Lewis weights, the corresponding metric, and their derivatives, which we exploit throughout the proof. For sake of clarity of the calculations, we denote the matrix $\mathbf{P}_{x}{\mathrm{R}_{x,v}}\mathbf{P}_{x}$ regarding vector $v$ , which will appear a number of times by $\mathbf{\tilde{P}}_{x,v}$ . Here we show the infinity norm self-concordance for the first and second order derivative of the metric as a warm up. For the proof of our third order self-concordance, we refer the reader to section C. In this section, for sake of brevity and clarity of the proof, we do not track the constants (which depends on $\frac{1}{4/p-1}$ ) and all of our inequalities $\lesssim,\preccurlyeq$ are up to log factors.

The following Lemma is proved in appendix E.2.

Lemma 3.1 ( $p$ -Lewis-weight metric).

The p-Lewis weight metric $g_{1}=\nabla^{2}\log\det{\left(\mathrm{A}_{x}^{T}\mathbf{W}_{x}^{1-2/p}\mathrm{A}_{x}\right)}$ can be written in the following form

[TABLE]

or alternatively

[TABLE]

In the following Lemma we state a vital $\|.\|_{\infty\rightarrow\infty}$ norm bound for the matrix $\mathbf{G}_{x}^{-1}\mathbf{W}_{x}$ which enables us to obtain Löwner inequalities by pulling off the $\|.\|_{x,\infty}$ norm of $v$ , the direction of the derivative. Note that condition $p<4$ is vital for this norm bound.

Lemma 3.2 (Operator infinity norm bound).

For $y=\mathbf{G}_{x}^{-1}\mathbf{W}_{x}s$ , given any vector $s$ and $p<4$ , we have

[TABLE]

Proof.

The proof can be found in Appendix D.1. ∎

Next, we state a lemma regarding the expansion of the directional derivative of the Lewis weights metric $g_{1}$ .

Lemma 3.3 (Derivative of the $p$ -Lewis weights metric).

Given arbitrary direction $v$ , we have

[TABLE]

We have numbered the terms above by $(\triangleright\dots)$ to refer to them later on.

In order to show the first, second, and third self-concordance of our metric, we need to control the terms above as well as their first and second derivatives. We give the proof for the first and second order self-concordance in this section and delay the proof of third order self-concordance to appendix C. Here, we start with a lemma which illustrates the calculation of the derivative of the $(\star 4)$ term above. Ultimately we derive spectral bounds for each of the terms in these derivatives. We do not care about constants and factors of $p$ in these calculations (note that with the choice $p=4-1/\log(m)$ these factors are at most polylogarithmic). Therefore, to simplify our calculation a bit, we ignore these constants.

Lemma 3.4.

The derivative of the term $(\triangleright 4)$ in Equation (22) in direction $z$ is given by (up to constants)

[TABLE]

where in the last term $\mathrm{D}({\mathrm{R}_{x,v}})(z)$ we are considering $v$ as a fixed vector (i.e. the derivative in direction $z$ does not hit $v$ ).

Proof.

Follows from ordinary differentiation and applying Lemma 2.3. ∎

In order to get a handle on these matrices via Löwner ordering, we derive various stability Lemmas for the derivatives of the Lewis weights and their related matrices $\mathbf{\Lambda}_{x}$ , $\mathbf{G}_{x}$ , etc and the stability of their derivatives. For example, we show the following third order self-concordance type property for Lewis weights themselves. The following Lemma is proved in Appendix D.2 in Lemma D.12.

Lemma 3.5 (Third derivative bound for Lewis weights).

We have

[TABLE]

where recall $\mathbf{W^{\prime}}_{x,v}=\mathrm{D}\mathbf{W}_{x}(v)$ .

Recall that the symbol $\preccurlyeq$ means Löwner order up to a constant factor. For more details and the proofs, we refer the reader to Appendix D. Next, we proceed to show our first- and second-order strong self-concordance for the Lewis weight barrier. Note that strong self-concordance is easily checked for the log barrier, so the major remaining challenge is to prove it for the Lewis weights barrier. The general theme of the proof is that we pull out the infinity norm of the directional derivative vectors $v,w,u$ from the tensors that are generated as a result of differentiation. This requires us to develop estimates on various fundamental matrix quantities that we defined in section 2, namely $\mathbf{G}_{x},\mathbf{\Lambda}_{x},{\mathrm{R}_{x,v}}$ at any point $x$ inside the polytope. Importantly, we develop these estimates with respect to the $\|.\|_{x,\infty}$ norm instead of the usual metric norm $\|.\|_{g}$ , which crucially requires $p<4$ . This constraint on $p$ has its root in controlling the $\|.\|_{\infty\rightarrow\infty}$ norm of the matrix $\mathbf{G}_{x}^{-1}\mathbf{W}_{x}$ in Lemma D.1.

Lemma 3.6 (First order infinity norm self-concordance).

For a direction $v$ we have

[TABLE]

Proof.

Direct consequence of Lemmas D.4 and D.13. ∎

In the rest of this section, we bring the proof of the second order strong self-concordance of our metric.

Lemma 3.7 (Second order infinity norm self-concordance).

The second derivatives of the metric $g_{1}$ of our hybrid barrier is bounded as

[TABLE]

Proof.

The goal is to look at the quadratic form of $\mathrm{D}g(z,v)$ on arbitrary vector $\mathrm{q}$ , i.e. $\mathrm{q}^{\top}\mathrm{D}g(z,v)\mathrm{q}$ and control it with $\|s_{z}\|_{\infty}\|s_{v}\|_{\infty}\|\mathrm{q}\|_{g}^{2}$ . First, we consider each of the subterms as a result of differentiating $(\triangleright 4)$ in Lemma 3.3, in direction $z$ . This derivative is expanded in Lemma 3.4. Regarding the term $(\triangleright\triangleright 1)$ of this expansion in Lemma 3.4, we have

[TABLE]

Next, for the $(\triangleright\triangleright 2)$ term in Lemma 3.4:

[TABLE]

The first part is similar to the handle of term $(\triangleright\triangleright 1)$ in Equation (23). For the second part:

[TABLE]

For the $(\triangleright\triangleright 3)$ term in Lemma 3.4:

[TABLE]

where we used Lemma D.7 and D.3. Next, for term $(\triangleright\triangleright 4)$ :

[TABLE]

Terms $(\triangleright\triangleright 5)$ and $(\triangleright\triangleright 6)$ are similar. For term $(\triangleright\triangleright 7)$ , for the first term $\mathrm{A}_{x}^{\top}\mathbf{P}_{x}\odot({\mathrm{R}_{x,z}}\mathbf{P}_{x}{\mathrm{R}_{x,v}})\mathbf{P}_{x}\mathbf{G}_{x}^{-1}\mathbf{\Lambda}_{x}\mathrm{A}_{x}$ , note that

[TABLE]

which can be dealt with similar to $(\triangleright\triangleright 1)$ term using Lemma D.1. The second term $\mathrm{A}_{x}^{\top}\mathbf{P}_{x}\odot(\mathbf{P}_{x}{\mathrm{R}_{x,z}}{\mathrm{R}_{x,v}}\mathbf{P}_{x})\mathbf{G}_{x}^{-1}\mathbf{\Lambda}_{x}\mathrm{A}_{x}$ in $(\triangleright\triangleright 7)$ is also similar to $(\triangleright\triangleright 1)$ . For the last term in $(\triangleright 7)$ , note that

[TABLE]

which implies

[TABLE]

As a result,

[TABLE]

The bound for term $(\triangleright\triangleright 8)$ in Lemma 3.4 follows similarly, using Lemma D.14:

[TABLE]

Next, we move on to bound the directional derivative of term $(\triangleright 5)$ in Lemma 3.3, in direction $z$ . This derivative is calculated in Lemma E.3 in the Appendix. For subterm $(\triangleright\triangleright 1)$ of $(\triangleright 5)$ defined in Lemma E.3, using Lemma D.13:

[TABLE]

For subterm $(\triangleright\triangleright 2)$ of $(\triangleright 5)$ defined in Lemma D.13, we have using Lemmas D.15 and D.3:

[TABLE]

For subterm $(\triangleright\triangleright 3)$ of $(\triangleright 5)$ , using Lemmas D.3, D.4, and D.13:

[TABLE]

Subterm $(\triangleright\triangleright 4)$ of $(\triangleright 5)$ is similar to $(\triangleright\triangleright 3)$ and subterm $(\triangleright\triangleright 5)$ is similar to subterm $(\triangleright\triangleright 1)$ .

Now considering the second formulation of the metric presented in Lemma 3.1, in Equation (21), above we handled the case where one of the directional derivatives, with respect to either $v$ or $z$ , hits the $P^{(2)}$ part in the last term of the metric in Equation (21). Hence, regarding this last term, the remaining terms in its derivative are the ones for which the derivative with respect to both of $v$ and $z$ hit either the $\mathrm{A}_{x}$ matrix or the $\mathbf{G}_{x}^{-1}$ matrix, i.e.

[TABLE]

All of the terms in (25) can be bounded by $O(\|s_{\ell}\|_{w}^{2}\|s_{z}\|_{\infty}\|s_{v}\|_{\infty})$ . For terms in the first line of Equation (25) we use Lemmas D.3 and D.9. For the second line we use Lemmas D.3 and D.4, and D.1. The $O(\|\mathrm{q}\|_{g_{1}}^{2}\|s_{z}\|_{\infty}\|s_{v}\|_{\infty})$ bound on the rest of the terms in Equation (25) follows from Lemmas D.3 and D.1 as well. Hence, overall we have shown for the last term in Equation (21):

[TABLE]

On the other hand, the derivative of the initial terms $\mathrm{A}_{x}^{\top}\mathbf{W}_{x}\mathrm{A}_{x}$ , $\mathrm{A}_{x}^{\top}\Lambda\mathrm{A}_{x}$ , $\mathrm{A}_{x}^{\top}\mathbf{G}_{x}\mathrm{A}_{x}$ in Equation (21) are similarly handled using Lemmas D.12, D.9, and D.3, and D.1. This completes the proof of the second order strong self-concordance for $g_{1}$ .

∎

Next, we move on to the third order self-concordance. For this, the number of terms grow quite large but luckily bounding them uses a similar approach. Hence, to give the essential ideas and derivations, we omit the proofs for the similar terms and only illustrate with the directional derivative of the $(\triangleright 4)$ term in Equation (3.3), which is the most complicated to handle. We state our final result for the directional derivatives of $(\triangleright 4)$ in Lemma 3.8 below (for the proof, see Appendix C).

Lemma 3.8 (Second derivative of $(\triangleright 4)$ ).

Let $\mathbf{B}_{x,v}$ be the symmetrized version of the $(\triangleright 4)$ term in Lemma 3.3:

[TABLE]

where recall $\mathbf{\tilde{P}}_{x,v}=\mathbf{P}_{x}{\mathrm{R}_{x,v}}\mathbf{P}_{x}$ . Then, two times derivative of $B_{x,v}$ in directions $z$ and $u$ can be spectrally controlled by the metric norm as the following:

[TABLE]

Finally, it is not hard to see that the log barrier also satisfies the infinity norm strong self-concordance. For completeness, we state this in the following Lemma, proved in Appendix D.6.

Lemma 3.9 (Infinity self-concordance of the log barrier).

The metric $g_{2}=\nabla^{2}\phi_{\ell}$ regarding the log barrier $\phi_{\ell}(x)=-\sum_{i=1}^{m}\log(a_{i}^{T}x-b_{i})$ in the polytope satisfies infinity norm third order strong self-concordance:

[TABLE]

Combining Lemma 3.9 with the infinity norm self-concordance of the $p$ Lewis weights metric proves the infinity self-concordance of the metric regarding our hybrid barrier.

Proof of Lemmas 1.4 and 1.3.

Proof of Lemma 1.4 is a direct consequence of Lemmas 3.6, 3.7, and C.1, and 3.9. Proof ofo Lemma 1.3 follows from Lemma 1.4 and noting the fact that the $\|.\|_{x,\infty}$ norm can be upper bounded by the $\|.\|_{g}$ norm according to Lemma 7.4. ∎

4 Bounding conductance and mixing time

The goal of this section is to illustrate how we combine different pieces together to prove Theorem 1.1. To this end, we prove a general purpose mixing time on a manifold in Theorem 4.1. The key to show Theorem 4.1 is Lemma 4.6 which we defer its proof to later. We start by defining an important concept of a ”Nice set,” which links the initial velocity $v_{x_{0}}$ to the $(R_{1},R_{2},R_{3})$ normality.

Definition 10 (Nice set).

Given $x_{0}\in\mathcal{M}$ , we say a set $Q_{x_{0}}\subseteq T_{x_{0}}(\mathcal{M})$ is $(R_{1},R_{2},R_{3},\delta)$ -nice if for $v_{x_{0}}\sim\mathcal{N}(0,g(x_{0})^{-1})$ , we have

$\mathbb{P}(v_{x_{0}}\notin Q_{x_{0}})\leq 0.001$ . 2. 2.

for every $x_{1}$ with $d(x_{1},x_{0})\leq\delta$ , the Hamiltonian family of curves between $x_{0}$ and $x_{1}$ ending at $Ham^{\delta}(x_{0},v_{0})$ is $(R_{1},R_{2},R_{3})$ -normal.

Theorem 4.1.

Suppose we want to sample from some distribution $\pi$ on the manifold $\mathcal{M}$ , starting from distribution $\pi_{0}$ with $M=\sup_{x\in\mathcal{M}}\frac{d\pi_{0}(x)}{d\pi(x)}$ . Suppose there exists a set $S\subseteq\mathcal{M}$ with $\pi(S)\geq 1-O(\epsilon/M)$ , such that for every $x_{0}\in S$ there exists an $(R_{1},R_{2},R_{3},\delta)$ -nice set $Q_{x_{0}}\subseteq T_{x_{0}}(\mathcal{M})$ . Moreover, let $\psi$ be the isoperimetric constant of the pair $(\mathcal{M},g)$ . Then, for any $\delta$ satisfying $\delta^{2}R_{1}\leq 1$ , $\delta^{2}R_{3}\leq 1$ , $\delta^{3}R_{2}\leq 1$ , the mixing time to reach a distribution within TV distance $\epsilon$ of $\pi$ is bounded by

[TABLE]

Proof.

Now with this choice of $\delta$ , Lemma 4.6, which given a nice set for $x_{0}$ shows a bound on the closeness of the one step distributions, implies for every $x_{0}\in S$ and every $x_{1}$ within distance $d(x_{0},x_{1})\leq\delta$ :

[TABLE]

Using Theorem 2.7, for $\rho=\mathbb{P}(S^{c})=O(\epsilon/M)$ we get a lower bound on the $s$ -conductance for $s=O(\epsilon/M)$ :

[TABLE]

Now using Lemma 2.6 with the same choice of $s$ ,

[TABLE]

where we used the fact that $H_{s}\leq Ms=O(\epsilon)$ (recall the definition of $M$ ) and the fact that we pick $t$ of the order $\log(M)(\psi\delta)^{2}$ as $H_{s}/s\leq\epsilon$ . The proof is complete.

∎

What remains to show is Lemma 4.6 regarding the closeness of the one step distributions of the Markov chain. which is the main content of this section. This is vital in proving Theorem 4.1 as it is one of the main building blocks, in addition ot the isoperimetry of the target measure, to bound the conductance of the chain.

To prove Lemma 4.6, we start with some definitions. The overall plan is that we approximate the density of a Hamiltonian step as written in Equation (26) as in Equation (27) and bound its change going from $x_{0}$ to $x_{1}$ for most of the vectors $v_{x_{0}}$ within a nice set in the tangent space of $x_{0}$ .

Definition 11.

Consider a family of Hamiltonian curves $\gamma_{s}(t)$ for time interval $s,t\in[0,\delta]$ all ending at $y$ , where $\gamma(0)=x$ , and $\gamma^{\prime}(0)=v_{x}$ . Define the local push-forward density of $v_{x}\sim\mathcal{N}(0,g^{-1})$ onto $y$ by

[TABLE]

where $J^{v_{x}}_{y}$ is the inverse Jacobian of the Hamiltonian after time $\delta$ , sending $v_{x}$ to $y$ , which we denoted by $Ham^{\delta}$ . we consider the Jacobian as an operator between the tangent spaces. The push forward density at $y$ with respect to the manifold measure is given by

[TABLE]

Note that $dg(y)$ refers to the manifold measure. Define the approximate local push-forward density of $v_{x}$ as

[TABLE]

Lemma 4.2 (Lemma 22 in [21]).

For an $R_{1}$ -normal Hamiltonian curve, for $0\leq\delta^{2}\leq\frac{1}{R_{1}}$ we have

[TABLE]

Lemma 4.3 (Lemma 32 in [21]).

In the setting of Lemma 4.4, for an $(R_{1},R_{3})$ normal $\gamma_{0}$ , denoting $\frac{d}{ds}\gamma_{s}(0)$ by $z$ , we have

[TABLE]

Lemma 4.4 (Change of the pushforward density).

Consider the family of smooth Hamiltonian curves $\gamma_{s}(t)$ up to time $\delta$ from $x_{0}$ to $x_{1}$ pointing towards $y$ , namely $\gamma_{0}(0)=x_{0}$ , $\gamma_{0}(\delta)=y$ , and $\gamma_{s}^{\prime}(0)=v_{x}$ regarding a point $x=\gamma_{s}(0)$ along the geodesic between $x_{0}$ to $x_{1}$ whose tangent to the geodesic is $z\triangleq\frac{d}{ds}\gamma_{s}(0)$ . Then, given that $\gamma_{s}(t)$ is $(R_{1},R_{2},R_{3})$ normal for $0\leq s,t\leq\delta$ and $\delta^{2}\leq\frac{1}{R_{1}}$ , we have

[TABLE]

Proof.

Simply differentiating Equation (27):

[TABLE]

where we used Lemma 4.3. Furthermore, using Lemma 5.5 and noting our assumption $\|z\|=\|\frac{d}{ds}\gamma_{s}(0)\|=1$ :

[TABLE]

∎

Lemma 4.5 (Change in probability of events under approximate density).

Let $Q_{x_{0}}\subseteq T_{x_{0}}(\mathcal{M})$ be a $(R_{1},R_{2},R_{3},\delta)$ nice set in the tangent space of $x_{0}$ and let $x$ be an arbitrary point in the geodesic between $x_{0}$ and $x_{1}$ . For vector $v_{x}$ in the tangent space of $x$ with $Ham^{\delta}(x,v_{x})=y$ we can consider the family of hamiltonian curves $\gamma_{s}(t)$ between $x_{0}=\gamma_{0}(0)$ and $x_{1}=\gamma_{\delta}(0)$ with $\gamma_{s}(\delta)=y$ for all $0\leq s\leq\delta$ .Now let $p_{n}$ be the finite measure obtained by restricting the normal distribution in the tangent space of $x$ to vectors $v_{x}$ for which the corresponding $v_{x_{0}}=\gamma^{\prime}_{0}(0)\in Q_{x_{0}}$ . For a point $y\in\mathcal{M}$ , let $\tilde{P}_{x}^{n}(y)$ be the approximate pushforward density of $p_{n}$ onto $\mathcal{M}$ , defined as

[TABLE]

where $\tilde{P}_{x}^{v_{x}}(y)$ is defined in (27). We define $\tilde{P}^{n}(.)$ to be the corresponding finite measure. Now given a fixed event $Y\subset\mathcal{M}$ with probability $\tilde{P}^{n}(Y)\geq n^{-10}$ , we have

[TABLE]

and for all $Y$ :

[TABLE]

Note that $\tilde{P}_{x}^{n}$ depends on $x=\gamma_{s}(0)$ , and we are fixing the set $Q_{x_{0}}$ in the tangent space of $x_{0}$ .

Proof.

Let $\tilde{P}_{1}^{n}$ be the density of further restricting $\tilde{P}^{n}$ to $v_{x}$ ’s for which $\langle v_{x},z\rangle\lesssim 1$ where recall $z\triangleq\frac{d}{ds}\gamma_{s}(0)$ , and $\tilde{P}_{2}^{n}$ be such that $\tilde{P}^{n}(y)=\tilde{P}_{1}^{n}(y)+\tilde{P}_{2}^{n}(y)$ . Note that

[TABLE]

But note that for the first term

[TABLE]

To see why the second line holds, note that the hamiltonian curve from $x$ to $y$ is $(R_{1},R_{2},R_{3})$ normal from our assumption for time $t\in(0,\delta)$ . The second line follows from Lemma 4.4. The third line follows simply by the choice $\langle v_{x},z\rangle\leq 1$ .

Similarly for the second term

[TABLE]

where we used $|\langle v_{x},z\rangle|\leq\|v_{x}\|_{g}\|z\|_{g}$ . Combining these and putting back in (31) implies

[TABLE]

To show case (30), using the fact that the densities regarding $\tilde{P}^{n}$ and $P^{n}$ are within constant of one another (28):

[TABLE]

which follows from assumption on $Y$ while

[TABLE]

which follows becuae $1\lesssim\langle v_{x},z\rangle$ is a low probability event using gaussian tail bound. This completes the proof. ∎

Using the bounds on smoothness, we will show that one-step distributions of RHMC from two nearby points will have large overlap (and hence TV distance less than $1$ ).

Lemma 4.6 (One-step coupling for RHMC).

Consider two points $x_{0}$ and $x_{1}$ and suppose $Q_{x_{0}}$ is a $(R_{1},R_{2},R_{3},\delta)$ -nice set in the tangent space of $x_{0}$ . Now given step size $\delta$ such that $\delta^{2}\leq\frac{1}{R_{1}},\ \delta^{3}R_{2}\leq 1,\ \delta^{2}R_{3}\leq 1$ and close by point $x_{1}$ such that $d(x_{0},x_{1})\leq\delta$ , where $d$ is the distance on the manifold, the total variation distance between $P_{x_{0}}$ and $P_{x_{1}}$ is bounded by $0.01$ .

Proof.

Similar to (29), we define

[TABLE]

First, note that for any event $Z\subseteq\mathcal{M}$ , we have using Lemma 6.7

[TABLE]

Suppose $Y\subseteq\mathcal{M}$ be a set for which

[TABLE]

This means $P_{x_{0}}(Y)\geq 0.01$ , and in particular from (78)

[TABLE]

which also implies

[TABLE]

Now from (28) we have $\tilde{P}^{n}(Y)\geq 0.001$ . But now using the assumptions on $R_{2}$ and $R_{3}$ and plugging it into Equation (30) in Lemma 4.5 we can state

[TABLE]

which implies at time $s=\delta$ we have

[TABLE]

or in other words

[TABLE]

Now again applying the constant boundedness of the ratio between $\tilde{P}^{n}$ and $P^{n}$ , we obtain

[TABLE]

By picking small enough constants, Equation (33) implies

[TABLE]

This further implies from (78):

[TABLE]

which contradicts Equation (32). This completes the proof. ∎

Finally, Combining Theorems 4.1 and 1.8 and Lemma 1.7, we prove the main Theorem 1.1.

Proof of Theorem 1.1.

Given a fixed parameter $c>1$ , using Lemma 6.7, there exists a high probability set $S=S_{c}\subseteq\mathcal{M}$ ,

[TABLE]

such that every $x_{0}\in S$ has a corresponding nice set $Q_{x_{0}}\in T_{p}(\mathcal{M})$ .

(Recall $\pi$ is the distribution supported on the polytope with density $e^{-\alpha\phi}$ .)

Now for the same arbitrary $c>1$ we considered above, we wish to satisfy the conditions in Theorem 4.1 on $\delta$ , namely $\delta^{2}R_{1}(c)\leq 1$ , $\delta^{2}R_{3}(c)\leq 1$ , $\delta^{3}R_{2}(c)\leq 1$ (We have used this notation to emphasize that $R_{1},R_{2},R_{3}$ are function of $c$ ). But according to Theorem 1.8, these parameters can be set as:

[TABLE]

plus Lemma 1.7 imposes the following condition $\delta$ :

[TABLE]

Hence, the conditions on $\delta$ translates into

[TABLE]

Note that a sufficient condition on $\delta$ which satisfies all of the above constraints is (assuming $c\geq 1$ )

[TABLE]

Now to satisfy the condition $P(S)\geq 1-O(\epsilon)$ in Theorem 4.1, noting Equation (34), we set

[TABLE]

On the other hand, from Theorem 1.2, we see that for the choice of $p=4-\lambda$ converging to $4$ from below ( $\lambda$ is a small constant), the square of the isoperimetry constant is $\psi^{2}=\Theta(\max\{m^{-\frac{2/p}{2/p+1}}n^{-\frac{1}{2/p+1}},\alpha\})$ . Now plugging this $\psi$ and $\delta$ from (35) into Theorem 4.1 and noting the choice of $c$ we get the following mixing bound:

[TABLE]

But it is easy to check that picking $\lambda=\Theta(1/\log(n))$ only adds a $1/poly(\log(m))$ factor to $\delta$ . Note that with this choice of $\lambda$ , we have $m^{-\frac{2/p}{2/p+1}}n^{-\frac{1}{2/p+1}}=\Theta(n^{2/3}m^{1/3})$ , hence the mixing time becomes

[TABLE]

But note that if $\alpha\sqrt{\alpha_{0}}n^{1/2}\geq n^{2/3}$ or $n^{2/3}(\alpha\sqrt{\alpha_{0}})^{2/3}\geq n^{2/3}$ , then $\alpha^{-1}\leq n^{2/3}m^{1/3}$ . Hence, the mixing time boils down to

[TABLE]

∎

5 On the Geometry and Stability of Hessian Manifolds

In this section, we prove the smoothness of the operator $\Phi(t)$ , namely we show with that a nice Hamiltonian curve is $(R_{1},R_{2},R_{3})$ normal. Our proof does not open up the definition of the mtric $g$ and its derivatives for our hybrid barrier, instead we exploit the strong-self concordance property in Lemma 1.4 to show the desired smoothness bounds, hence our framework potentially can be applied in other settings. Interestingly, in order to bound the trace of certain operators that arise from bounding the smoothness of the Hamiltonian curves on manifold, it turns out that writing them as the average of random low rank tensors will enable us to apply our strong self-concordance estimates more efficiently and provide sufficient bounds to break the mixing time.

5.1 Bounding $R_{1}$

Lemma 5.1.

For the parameter $R_{1}$ regarding the Frobenius norm bound of $\Phi(t)$ , given the control over the infinity norm of $\|s_{v}\|_{\infty}\lesssim c$ , $\|v\|_{g}\lesssim c\sqrt{n}$ (note that the vector $v$ is inherent in the definition of $\Phi$ ), then we have

[TABLE]

Proof.

Directly follows from Lemmas 5.16 and 5.17. ∎

First, recall the definition of the Frobenius norm:

[TABLE]

To bound $R_{1}$ , i.e. the Frobenius norm of $\Phi(t)$ , note that

[TABLE]

where $R$ is the Riemann tensor and $M$ is obtained from the bias vector $\mu$ . In particular, we have

[TABLE]

We start from the Riemann tensor. The proof of this bound follows directly from the infinity norm second-order self-concordance of $g$ .

Lemma 5.2 (Frobenius norm of random Riemann tensor).

Assuming $\|s_{v}\|_{\infty}\lesssim c,\ \|v\|_{g}\lesssim c\sqrt{n}$ , we have

[TABLE]

Proof.

For the first term of $R(.,v)v$ as written in (36):

[TABLE]

For the second term of the Riemann tensor:

[TABLE]

∎

Lemma 5.2 states $\sqrt{n}$ as an upper bound on the Frobenius norm of $R(\ell,v)v$ given that the curve is nice.

Next, we prove a lemma regarding the expansion of the operator $M$ , applying the covariant derivative.

Lemma 5.3 (Subterms for operator $M$ ).

We have the following expansion for the subterms of operator $M$ :

[TABLE]

where

[TABLE]

Moreover,

[TABLE]

Proof.

By differentiating the first term:

[TABLE]

But noting that $\nabla(\alpha\phi)=g^{-1}\mathrm{D}(\alpha\phi)$ , the first and third terms are the same and we get the result. For the second term:

[TABLE]

Finally, for the second argument of the Lemma

[TABLE]

∎

Next, we bound the Frobenius norm of the $M$ part in the following lemma, again only using infinity norm second-order self-concordance of $g$ to bound each of the four terms.

Lemma 5.4 (Frobenius norm of operator $M$ ).

We have

[TABLE]

Proof.

To bound the Frobenius norm of the first part of the first term of operator $M$ stated in Lemma 5.3:

[TABLE]

where in the second line we are rewriting $v_{1}^{\top}Dg(\nabla\phi)$ as $\nabla\phi^{\top}Dg(v_{1})$ which is true due to the symmetry of the derivatives of the metric on Hessian manifolds, i.e. $\partial_{k}g_{ij}=\partial_{i}g_{jk}=\partial_{j}g_{ik}$ . Furthermore, we used Lemma 5.8 in the last line. For the second part of first term of $M$ , note that $D^{2}\phi=g$ , so the Frobenius norm is at most $n$ automatically. Next, for the first part of the second term of $M$ , again based on Lemma 5.3

[TABLE]

where in the last line we used Lemma 5.11. For the second part of the second term of $M$ , from Lemma 5.3:

[TABLE]

for the first part

[TABLE]

For the second part:

[TABLE]

∎

Combining Lemmas 5.4 and 5.2 concludes

[TABLE]

5.2 Bounding $R_{2}$

Here we state the bound on $R_{2}$ .

Lemma 5.5.

For point $x=\gamma_{s}(t)$ on a $(c,\delta)$ -nice Hamiltonian curve with $v=\gamma^{\prime}_{s}(t)$ , namely that $\|s_{\gamma^{\prime}_{s}}\|_{\infty}\leq c$ and $\|\gamma^{\prime}_{s}\|_{g}\leq c\sqrt{n}$ along the curve up to time $t=\delta$ , suppose now we move on the unit direction $z$ parameterized by $s$ . Then, the change in the trace of the operator $\Phi$ can be bounded as

[TABLE]

Proof.

Directly from Lemmas 5.6 and 5.14. ∎

In sections 5.2.1 and 5.2.2, we bound the change in the $M$ part and the Ricci part of $\Phi$ respectively.

5.2.1 Bounding the change in Operator $M_{x}$

Given a distribution $e^{-\phi(x)}$ that we want to sample from, we study the properties of the derivatives of the corresponding operator $M$ which is defined as

[TABLE]

where

[TABLE]

Recall from Lemma 5.3:

[TABLE]

where we defined matrices $A_{1}(v_{1})$ and $A_{2}(v_{1})$ . Here we introduce the main lemma of this section which bounds the derivative of the trace of $M$ :

Lemma 5.6 (Bound on the change of operator $M$ ).

For operator $M$ defined in (40) for any unit direction $z$ we have

[TABLE]

Proof.

To prove Lemma 5.6, we bound the derivative of $tr(A_{1})$ and $tr(A_{2})$ in direction $z$ separately in Lemmas 5.7 and 5.9. As a result, the proof of Lemma 5.6 directly follows from Lemmas 5.7 and 5.9. ∎

We start from $tr(A_{1})$ in the following Lemma.

Lemma 5.7 (Trace of $A_{1}$ ).

Regarding the operator $A_{1}(v_{1})=\nabla_{v_{1}}(\nabla(\alpha\phi))$ , we have

[TABLE]

Proof.

Note that from Lemma 5.3:

[TABLE]

For the second part, note that $D^{2}(\alpha\phi)=\alpha g$ . Hence

[TABLE]

So we only need to handle the derivative of the first part. First, we bound the $g$ -norm of the vector $\nabla\phi$ in the following helper lemma.

Lemma 5.8.

For the gradient of the potential $\phi$ we have

[TABLE]

Proof.

We decompose the potential as $(\alpha\phi)=\phi_{1}+\phi_{2}$ for

[TABLE]

where $\mathbf{W}_{x}$ are the $p$ -Lewis weights.

Now using Lemma D.28, we have

[TABLE]

where $\mathbf{\hat{P}}_{x}=\mathbf{P}(\mathbf{W}_{x}^{1/2}\mathrm{A}_{x})=\mathbf{P}(\mathbf{W}_{x}^{1/2}\mathrm{A}_{x})$ is the projection matrix regarding the reweighted matrix $\mathbf{W}_{x}^{1/2}\mathrm{A}_{x}$ by the Lewis weights $w_{x}$ . Note that we are using Lemma 2.1 to conclude that $\mathrm{A}_{x}^{\top}\mathbf{W}_{x}\mathrm{A}_{x}\preccurlyeq g$ . For the log barrier part, similarly:

[TABLE]

which completes the proof. Now we handle the first term of the $M$ operator, namely the first term in (41) using the helper Lemmas. ∎

Now we got back to bound the first term in (42), which we can expand as

[TABLE]

For the first term in (43), according to Lemma 5.8:

[TABLE]

where we used Lemma D.27 to bound $\mathbb{E}_{v^{\prime}}\|s_{v^{\prime}}\|_{\infty}\sqrt{{v^{\prime}}^{T}gv^{\prime}}$ and used Lemma 7.4. For the second term in (43), we follow a similar reasoning:

[TABLE]

Therefore, bounding $\texttt{tr}(g^{-1}\mathrm{D}g(\mathrm{D}(\nabla(\alpha\phi))(z)))$ boils down to bounding $\|\mathrm{D}(\nabla(\alpha\phi))(z)\|_{g}$ . Focusing on the subterm of ${\mathrm{D}(\nabla(\alpha\phi))(z)}^{\top}g\mathrm{D}(\nabla(\alpha\phi))(z)$ regarding $\phi_{1}$ , namely $\|\mathrm{D}(\nabla\phi_{1})(z)\|_{g}$

[TABLE]

where we used Lemma 5.8 and Lemma 7.4. Similarly for $\phi_{2}$ :

[TABLE]

where we used Lemma 7.4. Combining the above with the inequality

[TABLE]

and plugging back into Equation (45) implies the following bound on the second term in Equation (43), we have for the second term in Equation (43):

[TABLE]

For the third term in (43), we reduce it to the first group of terms. Note that

[TABLE]

which is the same upper bound obtained in Equation (44) and (48). Hence, combining Equations (44), (48), and (49) we conclude

[TABLE]

∎

Next, we focus on the second term in (41) and bound the derivative of the trace of the operator $A_{2}(v_{1})=\nabla_{v_{1}}(g^{-1}\texttt{tr}(g^{-1}\mathrm{D}g))$ .

Lemma 5.9 (Trace of $A_{2}$ ).

For operator $A_{2}(v_{1})=\nabla_{v_{1}}(g^{-1}\texttt{tr}(g^{-1}\mathrm{D}g))$ as defined in Equation (41) we have

[TABLE]

Let

[TABLE]

Proof.

From Lemma 5.3, we have

[TABLE]

We bound the derivatives of the two terms in Equation (50) separately in Lemmas 5.10 and 5.13. Hence, the proof of Lemma 5.9 directly follows from these Lemmas. ∎

We start from bounding the derivative of the first term in Equation (50), i.e. we wish to bound $|D(tr(g^{-1}Dg(\xi)))(z)|$ .

Lemma 5.10.

Regarding the first quadratic form in Equation (50), we can bound its trace as

[TABLE]

Proof.

To this end, we repeat a similar arguemnt as we did in Equation (43) for bounding

[TABLE]

In particular, our argument regarding $\nabla(\alpha\phi)$ in Equations (46) and (47) only cares about the bound on $\|\nabla(\alpha\phi)\|_{g}$ and $\|\mathrm{D}(\nabla(\alpha\phi))(z)\|_{g}$ . We show a similar bound for $\xi$ . As a warmup, we start by bounding the norm $\|\xi\|_{g}$ , then we move on to bounding $\|\mathrm{D}(\xi)(z)\|_{g}$ .

Lemma 5.11.

We have

[TABLE]

Proof.

We have

[TABLE]

where $\texttt{tr}(g^{-1}\mathrm{D}g)$ is a vector with its $i$ th entry equal to $tr(g^{-1}{D_{i}g})$ . The first inequality above is due to Cauchy-Schwarz, and the second one is due to Lemma D.27. ∎

Furthermore, we have the following bound on $\|\mathrm{D}(\xi)(z)\|_{g}^{2}$ :

Lemma 5.12.

For the derivative of $\xi$ in direction $z$ we have

[TABLE]

Proof.

Note that

[TABLE]

For the first term above,

[TABLE]

following our argument in (51):

[TABLE]

For the second term, we write the second $g^{-1}$ within the tracec as an expectation $\mathbb{E}_{v^{\prime}\sim\mathcal{N}(0,I)}v^{\prime}v^{\prime T}$ , i.e.

[TABLE]

Therefore, using independent normal vectors $v,v^{\prime}\sim\mathcal{N}(0,g^{-1})$ , we can rewrite the second term as

[TABLE]

where the first inequality follows from Cauchy-Schwarz and the second one follows from Lemma D.29 and the fact that $\mathrm{D}g(v)\lesssim\|s_{v}\|_{\infty}g$ . For the third term similarly

[TABLE]

Combining all three bounds similar to our argument for $\nabla(\alpha\phi)$ we conclude

[TABLE]

∎

According to Lemma 5.12, similar to our bound for $\nabla(\alpha\phi)$ by substituting $w$ with $D\xi(z)$ in Lemma D.31 we get

[TABLE]

Moreover, according to Lemma D.32 and Lemma 5.11:

[TABLE]

Further, using Lemma D.31 combined with Lemma 5.11:

[TABLE]

Hence, combining Equations (53), (54), and (55),

[TABLE]

which completes the bound for the trace of the first part $Dg(\xi)$ of the operator in Equation (50). ∎

Finally, we move on to bound derivative of the trace of the second operator in Equation (50), namely $D(tr(g^{-1}D(g\xi)))(z)$ .

Lemma 5.13.

We can bound the derivative of the trace of the second operator in Equation (50) as

[TABLE]

Proof.

Recall from Lemma 5.3:

[TABLE]

Now we wish to calculate the derivative of the trace of this operator, namely

[TABLE]

We separate the case when the derivation w.r.t $z$ is taken with respect to the outer $g^{-1}$ in (58). First, we calculate the derivative with respect to the outer $g^{-1}$ regarding the term $\texttt{tr}(g^{-1}B_{1})$ :

[TABLE]

Note that

[TABLE]

Note that this 2-form is symmetric and PSD since

[TABLE]

Moreover, note that

[TABLE]

Hence, Equation (59) can further be upper bounded as

[TABLE]

But we have already bounded the operator norm of $\texttt{tr}(g^{-1}\mathrm{D}g(v^{\prime})g^{-1}\mathrm{D}g(v^{\prime}))$ in Lemma D.30 by $\tilde{O}(\|s_{v^{\prime}}\|_{\infty}^{2})$ , which implies its trace can be at most $n\tilde{O}(\|s_{v^{\prime}}\|^{2})$ . Taking expectation, we have

[TABLE]

Hence, we conclude

[TABLE]

On the other hand, note that for the second term in Equation (57), there is a symmetry between the inner and outer $g^{-1}$ :

[TABLE]

Hence, it is sufficient to bound when taking derivative with respect $z$ hit one of them, namely the inner $g^{-1}$ .

Therefore, we move on to taking derivative with respect to the $D(g\xi)$ part of $tr(g^{-1}D(g\xi))$ . For this, we can again use the trick of writing $g^{-1}$ as $\mathbb{E}_{v\sim\mathcal{N}(0,g^{-1})}$ :

[TABLE]

But from Equation (57), we have

[TABLE]

Now taking derivative with respect to $z$ :

[TABLE]

But for the first term in (62), we can write:

[TABLE]

For the second term in (62):

[TABLE]

where we used the third order self-concordance property of $g$ with respect to the infinity norm, as shown in section C, and also Lemma 7.4. Combining Equations (60), (63), and (64) completes the porof of Lemma 5.13. ∎

5.2.2 Bounding the change in the Ricci Tensor

First, we state the main result of this section, which is a bound on the change of the Ricci tensor.

Lemma 5.14 (Bound on the change of Ricci tensor).

Given the assumptions of Lemma 5.5, we have

[TABLE]

Note that in the above, $v$ is implicitly a function of $s$ as well.

Proof.

According to Lemma A.5 has two terms. We start analyzing the first term:

$A_{1}:=-\frac{1}{4}\texttt{tr}(g^{-1}\mathrm{D}g(v_{1})g^{-1}\mathrm{D}g(v_{2}))$ term

Taking derivative of this subterm of Ricci tensor in direction $z$ :

[TABLE]

Now we use Lemmas 3.7 and 3.6 to bound these terms:

[TABLE]

Similarly

[TABLE]

Terms in the derivative of $A_{1}$ that involves the derivative of $v$

Differentiating $v$ with respect to $z$ , we get

[TABLE]

where we used Lemma D.25 to bound $\|\mathrm{D}v(z)\|_{g}$ . ∎

Second part of the Ricci Tensor.

We should take derivative of $v^{\top}Dg(g^{-1}tr(g^{-1}Dg))v$ in direction $z$ , which is the second term in the Ricci tensor according to Lemma A.5. As a warm up, we first bound the value of this term before taking derivative:

Before taking derivative w.r.t $z$

Note that the second part of the Ricci tensor is

[TABLE]

Hence, we only need to bound one of the RHS terms with high probability. We have

[TABLE]

Now to bound the derivative of this part of the Ricci tensor, first we pretend that $v$ is fixed. Then

[TABLE]

which we further bound as

[TABLE]

Next, we take derivative in direction $z$ from the second term of the Ricci tensor.

Taking derivative in direction $z$ .

First, we differentiate the inner $g^{-1}$ term in $v^{T}Dg(v)g^{-1}tr(g^{-1}Dg)$ :

[TABLE]

For the remaining derivatives we can substitute the inner $g^{-1}$ by $\mathbb{E}_{v\sim\mathcal{N}(0,g^{-1})}v^{\prime}{v^{\prime}}^{T}$ . Now for the remaining derivatives which does not involve differentiating $v$ :

[TABLE]

Finally we have to check when $z$ differentiates $v$ :

[TABLE]

where we used Lemma D.25 to bound $\|\mathrm{D}v(z)\|_{g}$ .

5.3 Bounding $R_{3}$

Here we bound the parameter $R_{3}$ which is defined as the maximum possible value of the norm of $\Phi(t)\zeta(t)$ , where $\zeta(t)$ is the parallel transport of the initial velocity. The idea is to bound the infinity norm of $\zeta(t)$ along the Hamiltonian curve, then show a more efficient bound compared to the naive operator norm of $\Phi(t)$ which works with both of the norms $\|s_{\zeta(t)}\|_{\infty}$ and $\|\zeta(t)\|_{g}$ .

Recall the definition of the parameter $R_{3}$ :

[TABLE]

where $\zeta(t)$ is the parallel transport of $\gamma^{\prime}(0)$ along the Hamiltonian curve $\gamma(t)$ .

Lemma 5.15 (Bound on $R_{3}$ ).

Given that $\gamma$ is $(c,\delta)$ -nice, we have

[TABLE]

up to time $\delta$ .

Proof.

From the definition of niceness, we have a $c$ upper bound on the infinity norm $\|s_{\gamma^{\prime}}\|_{\infty}$ . Using that, we can apply Lemma 5.18 to obtain

[TABLE]

Finally combining this with Lemmas 5.16 and 5.17:

[TABLE]

∎

Here we show a norm bound for $\Phi(t)$ which we used to bound $R_{3}$ . To this end, we show bounds on the Riemann tensor $R(,v)v$ and operator $M$ separately in Lemmas 5.16 and 5.17.

Lemma 5.16 (Operator norm of random Riemann tensor).

Assuming $\|s_{v}\|_{\infty}\lesssim c,\ \|v\|_{g}\lesssim\sqrt{n}$ , we have

[TABLE]

Proof.

Similar to Lemma 5.2, using the form of Riemann expansion in Equation (36):

[TABLE]

∎

Next, we state a similar mix norm bound for operator $M$ .

Lemma 5.17 (Operator norm of $M$ ).

we have

[TABLE]

Proof.

Recall from Lemma (5.3):

[TABLE]

Starting from the first part of the term $\langle\nabla_{v_{1}}(\nabla\phi),v_{2}\rangle$ :

[TABLE]

Note that for the second part, $\mathrm{D}^{2}\phi=g$ , hence the corresponding operator is the identity and has operator norm one.

Next, we move on to the second term of $M$ in (37). For the first part of it from Equation (50), we have:

[TABLE]

where we used Lemma 5.12. For the second part, note that from Equation (38):

[TABLE]

Starting from the first part, now we rewrite this term in a better way as

[TABLE]

Now due to Lemma D.30 the norm of the corresponding operator is one:

[TABLE]

For the second part in (65), we write it as

[TABLE]

Hence, the operator norm is bounded as

[TABLE]

∎

Next, we show a bound on the derivative of the infinity norm of the parallel transported vector $\zeta$ given that we know the infinity norm of $\gamma^{\prime}$ is constant (randomness + stability).

Lemma 5.18 (Infinity norm of the parallel transport).

Given $\delta\leq\frac{1}{c}$ and a $(c,\delta)$ -nice Hamiltonian curve $\gamma$ , we have for $t\leq\delta$ :

[TABLE]

where $\zeta$ is the parallel transport of $\gamma^{\prime}(0)$ along the curve.

Proof.

As $\zeta$ is the parallel transport vector, from opening up the covariant derivative being zero:

[TABLE]

which implies using Lemma 7.4:

[TABLE]

where we used $\|s_{\gamma^{\prime}}\|_{\infty}\lesssim c$ from the definition of niceness and the fact that parallel transport preserves the norm of $\zeta$ and $\|\zeta(0)\|_{g}=\|\gamma^{\prime}(0)\|_{g}\leq\sqrt{n}$ . This ODE implies to avoid blow up we should pick $\delta\lesssim\frac{1}{c}$ . Under this condition, we further get

[TABLE]

which completes the proof. ∎

In the next section, we show the stability of the infinity norm and the manifold norm of $\gamma^{\prime}$ along the curve $c_{t}(s)$ for $s=0$ to time $\frac{1}{n^{1/3}}$ , where $c_{t}(s)=\gamma_{s}(t)$ is defined for a fixed time $t$ .

6 Stability of Hamiltonian curves

In this section, we show that the niceness property holds for Hamiltonian curves with high probability, and is stable in a family of Hamiltonian curves.

6.1 Stability of the niceness property

Here we show that niceness property of Hamiltonian curves is stable.

Lemma 6.1 (Stability of norms).

For a family of Hamiltonian curves $\gamma_{s}(t)$ , given that $\gamma_{0}$ is $(c,\delta)$ -nice, then $\gamma_{s}(t)$ is also $(O(c),\delta)$ -nice for all $0\leq s\leq\delta$ . In other words, given that for all $0\leq t\leq\delta$ we have $\|s_{\gamma^{\prime}_{0}(t)}\|_{\infty}\leq c$ and $\|\gamma^{\prime}_{0}(t)\|_{g}\leq\sqrt{n}$ , then for all $0\leq t\leq\delta$ and $0\leq s\leq\delta$ under the condition

[TABLE]

we have:

[TABLE]

Proof.

Suppose we denote the time until which we run the Hamiltonian curve by $\delta$ , i.e. $0\leq t\leq\delta$ . Suppose the argument is not true, and consider the set $S$ to be the times $0\leq s\leq 1/n^{1/3}$ for which $f(s)=sup_{0\leq t\leq\delta}\|s_{\gamma^{\prime}(t,s)}\|_{\infty}<2c$ . Since $f(s)$ is continuous, the set $S$ is open. Hence, if we consider the infimum $s_{0}$ of times $s$ for which $f(s)\geq 1$ , then the infimum is attained, i.e. $f(s_{0})=2c$ , while $f(s)<2c$ for every time $s<s_{0}$ . Exactly the same way we can define the first time $s_{1}$ for which defining the function $f_{2}(s)=\sup_{0\leq t\leq\delta}\|\gamma^{\prime}_{s}(t)\|_{g}$ we have $f_{2}(s_{1})=2\sqrt{n}$ while $f_{2}(s)<2\sqrt{n}$ for $s<s_{1}$ .

First assume the case where $s_{0}\leq s_{1}$ . Now again from the continuity of $f$ and the fact that $[0,\delta]$ is a compact set, its supremum is attained in some time $t_{0}$ . This means

[TABLE]

for all $s<s_{0}$ , while $\|s_{\gamma^{\prime}_{s_{0}}(t_{0})}\|_{\infty}=2c$ . But now using this infinity norm bound for times $s\leq s_{0}$ (for the fixed time $t_{0}$ ), we can obtain an Frobenius norm bound for $\Phi(t_{0})(s)$ from Lemma in 5.1 as

[TABLE]

. Now we can apply Lemma 23 in [21] because condition $\delta^{2}R_{1}\lesssim 1$ is satisfied, so we get

[TABLE]

for every $s<s_{0}$ , where we are using the fact that $\|c^{\prime}(s)\|_{g}=1$ . But note that for $s<s_{0}$ we can write

[TABLE]

where the first line follows from opening the definition of covariant derivative. Finally, this ODE implies that $\|s_{\gamma^{\prime}_{s}(t_{0})}\|_{\infty}\lesssim s(\frac{1}{\delta}+c)<2c$ for all times $s<s_{0}$ (with the correrct choice of constants), which from continuity holds also for time $s_{0}$ . But this contradicts $|s_{\gamma^{\prime}_{s_{0}}(t_{0})}\|_{\infty}=2c$ , which completes the proof for the case $s_{0}\leq s_{1}$ . Note that we the use of this condition in the above proof is that the $g$ -norm condition does not fail until time $s_{0}$ .

Next, we consider the latter case $s_{1}<s_{0}$ . Similar to the above argument, until time $s\leq s_{1}$ we have the Frobenius bound on $\Phi(t)$ from Lemma 5.1, and again from Lemma 23 in [21] as $\delta^{2}\leq\frac{1}{\sqrt{n}(c^{2}+\alpha\sqrt{\alpha}_{0})}=\frac{1}{R_{1}}$ , we have

[TABLE]

for $s\leq s_{1}$ . Now we write an ODE to control the norm of $\|\gamma^{\prime}_{s_{1}}(t_{0})\|_{g}$ where $t_{0}$ is defined in the same way as the previous case, and get a contradiction:

[TABLE]

which implies

[TABLE]

Therefore, at time $s=\delta/4$ the change in $\|\gamma^{\prime}\|_{g}$ from its initial value is at most $1/2<\sqrt{n}/2$ , which means the value of $\|\gamma^{\prime}\|_{g}$ should have remained below $2\sqrt{n}$ . The contradiction completes the proof for the second case. ∎

Next, we show a helper lemma regarding the derivative of $\gamma^{\prime}_{s}(t)$ in direction $\frac{d}{ds}\gamma_{s}(t)$ :

Lemma 6.2.

On a $(c,\delta)$ -nice Hamiltonian curve with $\delta\leq\frac{1}{n^{1/4}c}$ , We have:

[TABLE]

Proof.

Note that from Lemma 1.7 we have $\|s_{\gamma^{\prime}_{s}(t_{0})}\|_{\infty}\leq c$ . Hence, from Lemma 5.1, we can apply Lemma 23 in [21] to obtain

[TABLE]

But now from Lemma D.26, setting $v=\gamma^{\prime}(t,s)$ and $z=\frac{d}{ds}\gamma(t,s)$ :

[TABLE]

From Lemma 1.7, we have $\|\gamma^{\prime}\|_{\infty}\leq c$ and note that from our assumption on the $s$ parameterization, $\|\frac{d}{ds}\gamma\|_{g}=1$ , which combined with Equation (69) finishes the proof. ∎

6.2 High probability bound on norms along the Hamiltonian curve

First, we show a norm bound for the $g$ norm along the Hamiltonian curve, given a bound at initial time.

Recall the ODE related to the RHMC for curve $\gamma$ is

[TABLE]

Opening this up

[TABLE]

First, we show a non-random bound on the norm $\|\gamma^{\prime}\|_{g}$ given a bound at time zero.

Lemma 6.3 (Boundedness of manifold norm along the Hamiltonian curve).

Suppose $\|\gamma^{\prime}(0)\|_{g}\leq\sqrt{n}$ . Then for time $t\leq 1$ we have

[TABLE]

Proof.

Note that

[TABLE]

hence, taking covariant derivative

[TABLE]

where we used Lemma D.23 to bound $\|\mu\|$ . This implies

[TABLE]

Solving this ODE,

[TABLE]

∎

Lemma 6.4 (Stability bound on the infinity norm along the curve).

For a hamiltonian curve with $\|\gamma^{\prime}(0)\|_{g}\leq\sqrt{n}$ , suppose for a fixed time $t_{1}$ we know $\|s_{\gamma^{\prime}(t_{1})}\|_{\infty}\lesssim c$ . Then for all times $t\in(t_{1}-\frac{1}{(1+\alpha\sqrt{\alpha}_{0})\sqrt{n}},t_{1}+\frac{1}{(1+\alpha\sqrt{\alpha_{0}})\sqrt{n}})$ we have

[TABLE]

Proof.

Consider the Hamiltonian ODE below:

[TABLE]

which implies

[TABLE]

Hence, using Lemma 7.4

[TABLE]

But using Lemma 6.3 having upper bound on the $g$ -norm of $\gamma^{\prime}$ at time zero implies a bound on the whole curve. Combining with Lemma D.23:

[TABLE]

This ODE implies that if at a given point the infinity norm of $\|s_{\gamma^{\prime}}\|_{\infty}$ is bounded by $c$ , then for times within $t\pm\frac{1}{c(1+\alpha\sqrt{\alpha_{0}})\sqrt{n}}$ we have an $O(c)$ bound on the infinity norm, which completes the proof. ∎

Lemma 6.5 (Stability bound on the $g$ -norm along the curve).

For a Hamiltonian curve with $\|\gamma^{\prime}(0)\|_{g}\leq\sqrt{n}$ , suppose for a fixed time $t_{1}$ we know $\|\gamma^{\prime}(t_{1})\|_{g}\lesssim c$ . Then for all times $t\in(t_{1}-\frac{1}{(1+\alpha\sqrt{\alpha}_{0})\sqrt{n}},t_{1}+\frac{1}{(1+\alpha\sqrt{\alpha_{0}})\sqrt{n}})$ we have

[TABLE]

Proof.

Directly from Lemma 6.3. ∎

Lemma 6.6.

Suppose we pick $x$ random from $e^{-\alpha\phi(x)}$ then run a Hamiltonian curve starting from $x$ with initial vector $\gamma^{\prime}(0)$ picked according to $\mathcal{N}(0,g^{-1})$ . Then, for any time $t_{1}\in(0,1)$ , with probability at least $1-poly(m)ce^{-\Theta(c^{2})}$ we have

[TABLE]

Proof.

From the property of the Hamiltonian curve, we know the joint density of $(\gamma(t),\gamma^{\prime}(t))$ is $e^{-\alpha\phi(x)}\times\mathcal{N}(0,g^{-1}(x))dxdv$ . Focusing on the probability of $v_{t}=\gamma^{\prime}(t)$ , we see that for each $i$ , $a_{i}^{T}v_{t}$ is a Gaussian distributed variable with variance

[TABLE]

where the inequality follows from Lemma 7.2. Hence, from Gaussian tail bound, for a fixed time $t$ :

[TABLE]

where note that $\|s_{v_{t}}\|_{\infty}$ is just the maximum of Gaussian random variables and we applied a union bound over the entries of $s_{v_{t}}$ . Moreover, note that $\|v_{t}\|_{g}$ is a subGaussian random variable with mean $O(\sqrt{n})$ and subGaussian parameter $O(1)$ . Hence

[TABLE]

Next, consider a cover $\mathcal{C}=\{t_{i}\}_{i=1}^{c(1+\alpha\sqrt{\alpha_{0}})\sqrt{n}}$ of equally distant times of the Hamiltonian curve from $t=0$ to $t=1$ . Apply the above argument for all the times in this cover with a union bound on top. This implies with probability at least $1-\text{poly}(m)ce^{-\Theta(c^{2})}$ , we have $\|s_{v_{t}}\|_{\infty}\lesssim c$ for all $t\in\mathcal{C}$ and $\|v_{t}\|_{g}\lesssim c\sqrt{n}$ , where we used the fact that $\alpha\sqrt{\alpha_{0}}=poly(m)$ . Now combining this with Lemmas 6.4 and 6.5 completes the proof. ∎

Next, we bring a Lemma which shows the existence of Nice sets, used in the Proof of Theorem 1.1.

Lemma 6.7.

[Existence of Nice set] There is a high probability region $S\subset\mathcal{M}$ such that $\pi(S)\geq 1-O(poly(m)e^{-c/2})$ (where recall $\pi(.)$ is the probability distribution of density $e^{-\phi}$ inside the polytope) and for every $x\in S$ , there is a high probability region $Q_{x}$ in the tangent space of $x$ , namely $\mathbb{P}(v_{x}\in Q_{x})\geq 0.999$ such that for all $v_{x}\in Q_{x}$ , the Hamiltonian curve starting from $x$ with initial vector $v_{x}$ is $(c,1)$ -nice, namely for all $0\leq t\leq 1$ :

[TABLE]

Proof.

For every point $x\in\mathcal{M}$ , define $Q_{x}$ to be the set of vectors in its tangent space such that the resulting curve is $c$ -nice up to time $1$ . Define region $S$ to be the the set of points $x$ on $\mathcal{M}$ such that $p_{v_{x}}(Q_{x})\geq 1-0.0005$ , where $p_{v_{x}}$ denotes the density of $\mathcal{N}(0,g^{-1})$ in the tangent space of $x$ (The constant $1-0.0005$ is motivated by the definition of nice sets). Now if it was the case that $\mathbb{P}(S^{c})\geq\text{poly}(m)ce^{-\Theta(c^{2})}$ , then under the joint distribution on $(x,v)$ , there is a region with probability at least $poly(m)ce^{-\Theta(c^{2})}$ such that the Hamiltonian curve starting from $x$ with initial vector $v$ is not $c$ -nice. But this contradicts Lemma 6.6. ∎

7 Isoperimetry

In this section, we show an the isoperimetry constant corresponding to our barrier, stated in Theorem 1.2.

Proof of Theorem 1.2..

From Lemma 7.3 and the definition of $g$ :

[TABLE]

This means that if we scale the ellipsoid $\{v|\ v^{\top}gv\leq 1\}$ by $\sqrt{pn}(\frac{m}{n})^{\frac{1/p}{2/p+1}}$ then it includes the symmetrized polytope around $x$ , whose unit ball is exactly $\{v|\ \|s_{x,v}\|_{\infty}\leq 1\}$ , i.e.

[TABLE]

On the other hand, from Lemma 7.4 we have

[TABLE]

which implies that the unit ball of the norm, or the Dikin ellipsoid, is contained in the symmetrized poltope around $x$ , i.e.

[TABLE]

Combining the relations (72) and (73) implies that the symmetric self-concordance parameter $\bar{\nu}$ defined in [17] is at most $\bar{\nu}\leq pn(\frac{m}{n})^{\frac{2/p}{2/p+1}}$ , which in turn implies that the distribution $e^{-\alpha\phi}$ has isoperimetry with constant at least $\frac{1}{\sqrt{\nu}}\geq\frac{1}{\sqrt{pn}}(\frac{n}{m})^{\frac{1/p}{2/p+1}}$ with respect to metric $g$ as desired.

Furthermore, using the Brascamp-Lieb inequality, we know $e^{-\alpha\phi}$ has isoperimetry at least $\sqrt{\alpha}$ on a manifold whose metric is the Hessian of $\phi$ [1]. Combining these two facts completes the proof. ∎

We denote the $i$ th row of the matrix $\mathrm{A}_{x}$ by $a_{i}$ . Note that if we have a bound on the quantity $a_{i}^{\top}g^{-1}a_{i}$ for our metric $g$ enables us to control the infinity norm of $s_{x,v}$ via the following simple Cauchy Schwarz on the $i$ th entry of $s_{x,v}$ :

[TABLE]

However, while we have the following relation

[TABLE]

only considering the $g_{2}$ subpart of our metric $g$ , the quantity $a_{i}^{T}{g_{2}}^{-1}a_{i}$ might be orders of magnitude larger than its counterpart $a_{i}^{\top}(\mathrm{A}_{x}\mathbf{W}_{x}^{1-2/p})\mathrm{A}_{x})^{-1}a_{i}$ in Equation (74). This is because recall as we state in 2.1

[TABLE]

but we do not have such spectral bounds between matrices $\mathrm{A}_{x}^{T}W\mathrm{A}_{x}$ and $\mathrm{A}_{x}^{T}W^{1-2/p}\mathrm{A}_{x}$ . In fact, authors in [19] show $\mathrm{A}_{x}^{T}W\mathrm{A}_{x}$ and $\mathrm{A}_{x}^{T}W^{1-2/p}\mathrm{A}_{x}$ are up to log factors spectrally the same, as long as $p$ is polylogarithmically large, but here we are not able to work with such large $p$ ’s since our infinity norm estimates break for $p\geq 4$ . Nonetheless, we show that adding the log barrier and appropriately rescaling the metric $g$ indeed enables us to bound $a_{i}^{T}g^{-1}a_{i}$ . To prove a bound on $a_{i}^{T}g^{-1}a_{i}$ , we start by comparing the matrix $g^{\prime}\triangleq\mathrm{A}_{x}^{T}\mathbf{W}_{x}\mathrm{A}_{x}+\frac{n}{m}\mathrm{A}_{x}^{\top}\mathrm{A}_{x}$ , which is proportional to the Hessian of the hybrid barrier before scaling by $\alpha_{0}$ , with the matrix $\mathrm{A}_{x}^{\top}\mathbf{W}_{x}^{1-2/p}\mathrm{A}_{x}$ , which then enables us to analyze the quantity $a_{i}^{\top}g^{-1}a_{i}$ via the closed form Equation (74). In the next Lemma, we compare these two matrices.

Lemma 7.1 (Löwner comparison with different weighted matrices).

For the PSD matrix $g^{\prime}=\mathrm{A}_{x}^{\top}\mathbf{W}_{x}\mathrm{A}_{x}+\frac{n}{m}\mathrm{A}_{x}^{\top}\mathrm{A}_{x}$ we have

[TABLE]

Proof.

Suppose for a given coefficient $\beta$ we wish to have

[TABLE]

The first thing we notice is that if $w_{i}^{1-2/p}\leq\beta\frac{n}{m}$ , then the inequality is already satisfied. Hence, w.l.o.g we assume

[TABLE]

in this regime of $w_{i}$ to pick a $\beta$ which satisfies Equation (75), we need to have

[TABLE]

But using Equation (76), it is sufficient to have

[TABLE]

so we need to pick $\beta$ as large as

[TABLE]

which completes the proof. ∎

Lemma 7.2 (Taming the hybrid metric).

For the metric of our hybrid barrier before scaling up by $\alpha_{0}$ , i.e. for $g^{\prime\prime}$ defined as

[TABLE]

we have for every $i$ :

[TABLE]

In particular, for the metric $g(x)$ of the hybrid barrier we have

[TABLE]

Proof.

Note that using Lemma 2.1, we have

[TABLE]

Hence, using Lemma 7.1:

[TABLE]

On the other hand,

[TABLE]

Balancing Equations (78) and (79) implies

[TABLE]

Finally, noting the fact that

[TABLE]

the proof is complete. ∎

Finally, using our estimate on $a_{i}^{\top}g^{\prime\prime-1}a_{i}$ in Lemma 7.2, we bound the $g^{\prime\prime}$ norm of an arbitrary vector $v$ :

Lemma 7.3 (Bounding the ellipsoid norm by the infinity norm).

We can bound the metric norm $g$ by the infinity norm $\|.\|_{x,\infty}$ as

[TABLE]

Proof.

Using Lemma 2.1, we have

[TABLE]

and

[TABLE]

Noting the definition of $g^{\prime\prime}$ in Equation (77) completes the proof. ∎

Lemma 7.4 (Bounding infinity norm by the ellipsoidal norm).

Given an arbitrary vector $z\in\mathbb{R}^{n}$ , we have

[TABLE]

Proof.

For all $i$ we have using Lemma 7.2:

[TABLE]

The second inequality follows from the fact that $\|r_{x,z}\|_{\infty}\leq\frac{1}{4/p-1}\|s_{x,z}\|_{\infty}$ from Lemma D.1. ∎

Lemma 7.5 (Infinity norm of random vectors).

For the metric $g$ of our hybrid barrier, given random vector $v\sim\mathcal{N}(0,g^{-1})$ , with high probability we have

[TABLE]

Proof.

Note that $g$ is just a scaled version of $g^{\prime\prime}$ :

[TABLE]

Now computing the variance of the $i$ th entry of $s_{x,v}$ , we observe using Lemma 7.2

[TABLE]

The bound on $\|r_{x,v}\|_{\infty}$ directly follows from the fact that $\|r_{x,v}\|_{\infty}\leq\frac{1}{4/p-1}\|s_{x,v}\|_{\infty}$ using Lemma D.1. ∎

Appendix A Riemannian Geometry

A.1 Basic Manifold Definitions

In this section, we go through some basic definitions in differential geometry that are essential to know in our proofs. A manifold is defined abstractly as a topological space which locally resembles $\mathbb{R}^{n}$ .

Definition 12.

A manifold $\mathcal{M}$ is a topological space such that for each point $p\in\mathcal{M}$ , there exists an open set $U$ around $p$ such that $U$ is a homeomorphism to an open set of $\mathbb{R}^{n}$ .

Tangent Space.

For any point $p\in\mathcal{M}$ , one can define the notion of tangent space for $p$ , $T_{p}(\mathcal{M})$ , as the equivalence class of the set of curves $\gamma$ starting from $p$ ( $\gamma(0)=p$ ), where we define two such curves $\gamma_{0}$ and $\gamma_{1}$ to be equivalent if for any function $f$ on the manifold:

[TABLE]

On can define a linear structure on $T_{p}(\mathcal{M})$ , hence it is a vector space. Now given a positive definite quadratic form $g(p)$ on the vector space $T_{p}(\mathcal{M})$ , one can equip the manifold $\mathcal{M}$ with metric $g$ . While the definition of a general manifold is abstract, putting a metric on it allows us to measure length, areas, volumes, etc. on the manifold, and do calculus similar to Euclidean space. Next, we define some basic notions regarding manifolds.

Differential.

For a map $f:{\mathcal{M}}\rightarrow\mathcal{N}$ between two manifolds, the differential $df_{p}$ at some point $p\in\mathcal{M}$ is a linear map from $T_{p}(\mathcal{M})$ to $T_{f(p)}(\mathcal{N})$ with the property that for any curve $\gamma(t)$ on $\mathcal{M}$ with $\gamma(0)=p$ , we have

[TABLE]

. As a special case, for a function $f$ over the manifold, the differential $df$ at some point $p\in\mathcal{M}$ is a linear functional over $T_{p}(\mathcal{M})$ , i.e. an element of $T^{*}_{p}(\mathcal{M})$ . Writing (81) for curve $\gamma_{i}$ with $\frac{d}{dt}\gamma_{i}(0)=\partial x_{i}$ , testing property (81), we see

[TABLE]

We can write $df=\sum_{i}\frac{\partial f}{\partial x_{i}}dx_{i}$ .

Vector field.

A vector field $V$ is a smooth choice of a vector $V(p)\in T_{p}(\mathcal{M})$ in the tangent space for all $p\in\mathcal{M}$ .

Metric and inner product.

A metric is a tensor on the manifold $\mathcal{M}$ which is simply a smooth choice of a symmetric bilinear map over $\mathcal{M}$ . Alternatively, the metric or dot product $\langle,\rangle$ can be seen as a bilinear map over the space of vector fields with the tensorization property, i.e. for vector fields $V,W,Z$ and scalar functions $\alpha,\beta$ over $\mathcal{M}$ :

[TABLE]

A.2 Manifold Derivatives, Geodesics, Parallel Transport

A.2.1 Covariant derivative

Given two vector fields $V$ and $W$ , the covariant derivative, also called the Levi-Civita connection $\nabla_{V}W$ is a bilinear operator with the following properties:

[TABLE]

where $V(\alpha)$ is the action of vector field $V$ on scalar function $\alpha$ . Importantly, the property that differentiates the covariant derivative from other kinds of derivaties over manifold is that the covariant derivative of the metric is zero, i.e., $\nabla_{V}g=0$ for any vector field $V$ . In other words, we have the following intuitive rule:

[TABLE]

Moreover, the covariant derivative has the property of being torsion free, meaning that for vector fields $W_{1},W_{2}$ :

[TABLE]

where $[W_{1},W_{2}]$ is the Lie bracket of $W_{1},W_{2}$ defined as the unique vector field that satisfies

[TABLE]

for every smooth function $f$ .

In a local chart with variable $x$ , if one represent $V=\sum V^{i}\partial x_{i}$ , where $\partial x_{i}$ are the basis vector fields, and $W=\sum W^{i}\partial x_{i}$ , the covariant derivative is given by

[TABLE]

The Christoffel symbols $\Gamma_{ij}^{k}$ are the representations of the Levi-Cevita derivatives of the basis $\{\partial x_{i}\}$ :

[TABLE]

and are given by the following formula:

[TABLE]

Above, $g^{ij}$ refers to the $(i,j)$ entry of the inverse of the metric. In the following Lemma, we calculate the Christoffel symbols on a Hessian manifold and $g=D^{2}\phi$ is the Hessian of a convex function.

Lemma A.1.

On a Hessian manifold with metric $g$ we have

[TABLE]

Proof.

Since the manifold is Hessian, we have

[TABLE]

where $Dg_{ijm}$ is just the notation that we use for Hessian manifolds.

∎

A.2.2 Parallel Transport

The notion of parallel transport of a vector $V$ along a curve $\gamma$ can be generalized from Euclidean space to a manifold. On a manifold, parallel transport is a vector field restricted to $\gamma$ such that $\nabla_{\gamma^{\prime}}(V)=0$ . By this definition, for two parallel transport vector fields $V(t),W(t)$ we have that their dot product $\langle V(t),W(t)\rangle$ is preserved, i.e., $\frac{d}{dt}\langle V(t),W(t)\rangle=0$ .

A.2.3 Geodesic

A geodesic is a curve $\gamma$ on $\mathcal{M}$ is a “locally shortest path”, i.e., the tangent to the curve is parallel transported along the curve: $\nabla_{\dot{\gamma}}\dot{\gamma}=0$ ( $\dot{\gamma}$ denotes the time derivative of the curve $\gamma$ .) Writing this in a chart, one can see it is a second order nonlinear ODE which locally has a unique solution given initial location and speed.

[TABLE]

A.2.4 Riemann Tensor

The Riemann tensor is particular tensor on the manifold which arise from the covariant derivative. In particular, it is a linear mapping from $T_{p}(\mathcal{M})\times T_{p}(\mathcal{M})\times T_{p}(\mathcal{M})\rightarrow T_{p}(\mathcal{M})$ defined as

[TABLE]

The Riemann tensor can be calculated in a chart given the following formula:

[TABLE]

In the following Lemma, we calculate the Riemann tensor on a Hessian manifold:

Lemma A.2.

The Riemann tensor is given by

[TABLE]

Proof.

We consider the terms in Equation (85) one by one. For the first term

[TABLE]

Similarly

[TABLE]

Hence

[TABLE]

For the third and forth terms

[TABLE]

Combining Equations (86) and (88) and plugging into (85) completes the proof. ∎

A.2.5 Ricci tensor

The Ricci tensor is just the trace of the Riemann tensor with respect to the second and third components or first and forth components, i.e. the trace of the operator $R(.,X)Y$ :

[TABLE]

Equivalently, if $\{e_{i}\}$ is an orthogonal basis in the tangent space, we have

[TABLE]

Lemma A.3 (Form of the Ricci tensor on Hessian manifolds).

On a Hessian manifold, the Ricci tensor is given by

[TABLE]

Proof.

Using the form of Riemann tensor in (85) and the definition of Ricci tensor in (89)

[TABLE]

Therefore, for arbitrary vector $v_{1}$ and $v_{2}$

[TABLE]

∎

A.2.6 Exponential Map

The exponential $\exp_{p}(v)$ at point $p$ is a map from $T_{p}(\mathcal{M})$ to $\mathcal{M}$ , defined as the point obtained on a geodesic starting from $p$ with initial speed $v$ , after time $1$ . We use $\gamma_{t}(x)$ to denote the point after going on a geodesic starting from $x$ with initial velocity $\nabla F$ , after time $t$ .

Lemma A.4 (Commuting derivatives).

Given a family of curves $\gamma_{s}(t)$ for $s\in[0,s^{\prime}]$ and $t\in[0,t^{\prime}]$ , we have

[TABLE]

Proof.

Let $\partial_{s}$ and $\partial_{t}$ be the standard vector fields in the two dimensional $\mathbb{R}^{2}$ space $(t,s)$ . Then, we know

[TABLE]

where $[.,.]$ is the Lie bracket. ∎

A.3 Hessian manifolds

In this work we are working with a specific class of manifold whose metric is impoesd by the Hessian of our hybrid barrier. A nice property of Hessian manifolds is that the terms in the Riemann tensor which depends on the second derivative of the metric cancels out, and we end up just with the first derivative and the metric itself. Specifically, for a Hessian manifold recall from Lemmas A.1, A.2, and A.5 we have the following equations for Cristoffel symbols, the Riemann tensor, and the Ricci tensor:

[TABLE]

As we mentioned, the change of the determinant of the Jacobian matrices $J^{v_{\gamma_{s}}}_{y}$ regarding the Hamiltonian family $(\gamma_{s}(t))$ between $x_{0}$ and $x_{1}$ is related to the rate of change of the Ricci tensor on the manifold. In Lemma A.5 below, we concretely calculate the Ricci tensor for a Hessian manifold in the Euclidean chart, based on the metric $g$ and its derivatives.

Lemma A.5 (Form of Ricci tensor on Hessian manifolds).

On a Hessian manifold, the Ricci tensor is given by

[TABLE]

we use the formula of Ricci tensor on manifold in section 5.2 and bound its derivative to bound the rate of change of the pushforward density of RHMC going from $x_{0}$ to $x_{1}$ in section 5.2.2. Note that we only need to have a multiplicative control over the change of density of a sampled Gaussian vector on the destination point on the manifold, as we move from $x_{0}$ to $x_{1}$ .

Appendix B Hamiltonian Curves and Fields on Manifold

Here we recall the formulation of the Hamiltonian curve based on covariant differentiation. Starting from the definition of the hamiltonian ODE for the potential $H(x,v)=f(x)+\frac{1}{2}\log((2\pi)^{n}\det g(x))+\frac{1}{2}v^{T}g(x)^{-1}v$ .

[TABLE]

Taking derivative with respect to $t$ from the first Equation and then using the second equation, we get

[TABLE]

which implies

[TABLE]

But the left hand side of Equation (91) is the definition of Christoffel symbols as in Lemma (A.1). To see this, note that

[TABLE]

where $\left(g(x)^{-1}\mathrm{D}g(x)[\frac{dx}{dt}]\frac{dx}{dt}\right)_{k}$ is the $k$ th entry of $g(x)^{-1}\mathrm{D}g(x)[\frac{dx}{dt}]\frac{dx}{dt}$ . Moreover

[TABLE]

Hence, from the definition of Cristoffel symbols and its expansion in Equation (A.2.1) we see

[TABLE]

where $D_{t}\frac{dx}{dt}=\nabla_{\frac{dx}{dt}}\frac{dx}{dt}$ is covariant differentiation and we look at $\frac{dx}{dt}=\sum_{i=1}^{n}\frac{dx_{i}}{dt}\partial x_{i}$ as a vector in the tangent space of $x$ . We define the right hand side of the above equation as the bias of Hamiltonian Monte Carlo:

[TABLE]

Proof of Lemma 1.6.

We start from the ODE of HMC:

[TABLE]

Taking covariant derivative in direction $s$ :

[TABLE]

Now we apply the definition of Riemann tensor. Namely for arbitrary vector fields $X,Y,Z$ , we have

[TABLE]

Setting $X=\partial_{s}\gamma_{s}(t)$ and $Y=\partial_{t}\gamma_{s}(t)$ , we first observe that $[\partial_{s}\gamma_{s}(t),\partial_{t}\gamma_{s}(t)]$ because they are just the application of the differential of $\gamma$ to the standard vectors $\partial_{s}$ and $\partial_{t}$ in $\mathbb{R}^{2}$ . Applying this above

[TABLE]

But note that because $\partial_{t}\gamma_{s}(t)$ and $\partial_{s}\gamma_{s}(t)$ are the image of the differential of $\gamma_{s}(t)$ applied to $\partial_{t}$ and $\partial_{t}$ , we have

[TABLE]

Applying Equation (93) to Equation (92):

[TABLE]

Noting the definition of the operator $M$ completes the proof. ∎

Appendix C Third order strong self-concordance of the metric

The goal of this section is to prove the following lemma.

Lemma C.1 (Infinity norm Self-concordance for Lewis-p-weight barrier).

The Lewis-p-weights barrier, defined in (3), is third-order strongly self-concordant with respect to the local norm $\|.\|_{x,\infty}$ , i.e., at any point $x$ on the Hessian manifold with metric $g(x)$ given by the Hessian $\nabla^{2}\phi_{1}$ of the Lewis-p-weights barrier $\phi_{1}$ , we have

[TABLE]

Now we first handle the derivatives in directions $z$ and $u$ of the $(\triangleright 4)$ term in Lemma 3.3. We state the final result regarding the $(\star 4)$ term in the Lemma 3.8, which we prove below.

Proof of Lemma 3.8.

The general style of the proof below is that $(\triangleright\triangleright)$ terms are referring to the subterms obtained from differentiating the $(\triangleright 4)$ term by $z$ , which are stated in Lemma 3.4. Note that the $(\triangleright 4)$ term itself is a subterm of the derivative of $g_{1}$ in direction $v$ which is stated in Lemma 3.3.

$(\triangleright 4)$ terms

The first subterm of the $(\triangleright 4)$ term that we consider is the $(\triangleright\triangleright 3)$ term as defined in Equation 3.4.

$(\triangleright\triangleright 3)$ term

[TABLE]

For the first part (1), using Lemma D.7:

[TABLE]

For the second part (2), note that

[TABLE]

where we are denoting the big chunk in the middle by $\mathfrak{D}$ for simplicity. But combining Lemma D.6 and D.7

[TABLE]

which implies

[TABLE]

Overall, we conclude

[TABLE]

For (3):

[TABLE]

(4) and (5) are similar. Term (7) is also similar to Equation $(\triangleright 4)(\triangleright\triangleright 3)$ after applying Lemma D.14. Next, we move on to $(\triangleright\triangleright 2)$ term.

$(\triangleright\triangleright 2)$ term

[TABLE]

Note that if $z$ differentiate any of the ${\mathrm{R}_{x,u}},{\mathrm{R}_{x,v}},\mathbf{W}_{x},$ or $\mathbf{W}_{x}^{\prime}$ , then handling those terms is similar to Equation $(\triangleright 4)(\triangleright\triangleright 2)$ .

term [1] is similar to $(\triangleright 4)(\triangleright\triangleright 1)$ and $(\triangleright 4)(\triangleright\triangleright 2)$ .

term [2] is similar to Equation $(\triangleright 4)(\triangleright\triangleright 1)$ and $(\triangleright 4)(\triangleright\triangleright 2)$ after using Lemma (D.14).

term [3] the first part is similar to Equation $(\triangleright 4)(\triangleright\triangleright 1)$ . For the second part

[TABLE]

which similar to $(\triangleright 4)(\triangleright\triangleright 2)$ can be upper bounded by

[TABLE]

as desired.

term [4] the first part is similar to Equation $(\triangleright 4)(\triangleright\triangleright 7)$ combined with the trick in (95). For the second part:

[TABLE]

term [5] the first part is similar to Equation $(\triangleright 4)(\triangleright\triangleright 1)$ and the second part is similar to (96).

term [6] part 1 is similar to [3] part 2, and part 2 is similar to term $(\triangleright 4)(\triangleright\triangleright 2)$ part 2.

term [7] is similar to $(\triangleright 4)(\triangleright\triangleright 2)$ .

term [8], the first part is similar to $(\triangleright 4)(\triangleright\triangleright 4)$ using the trick in (95). For term [8] second part

[TABLE]

term [9] is similar to what we did for [9]. term [10] first part similar to $(\star 4)(\ast 6)$ using the trick in (95). for the second part:

[TABLE]

$(\triangleright\triangleright 4)$ term

[TABLE]

term [1] is similar to $(\triangleright 4)(\triangleright\triangleright 2)[8]$ .

term [2] is similar to $(\triangleright 4)(\triangleright\triangleright 3)[3]$ .

term [3] is handled by Lemma D.14.

term [4] first part is similar to $(\triangleright 4)(\triangleright\triangleright 2)[8]$ part 1. term [4] part 2 is similar to $(\triangleright 4)(\triangleright\triangleright 4)$ . term [4] part 3 is similar to $(\triangleright 4)(\triangleright\triangleright 2)[8]$ part 2. For term [4] parts 4 and 5:

[TABLE]

term [5] is similar to $(\triangleright 4)(\triangleright\triangleright 4)$ .

term [6]:

[TABLE]

term [7]: similar to [6].

term [8]:

[TABLE]

$(\triangleright\triangleright 5)$ term

[TABLE]

These terms are similar to $(\triangleright 4)(\triangleright\triangleright 4)$ .

$(\triangleright\triangleright 6)$ term

[TABLE]

these terms are similar to $(\triangleright 4)(\triangleright\triangleright 4)$ .

$(\triangleright\triangleright 7)$ term

[TABLE]

where for simplicity, we have used the $\sum$ notation indicating all possible symmetric combinations of that term with respect to $v$ , $w$ , and $u$ .

term [1]: considering the quadratic form $\mathrm{q}^{\top}(.)\mathrm{q}$ on this term, note that on the left we get $s_{x,\mathrm{q}}^{\top}{\mathrm{S}_{x,u}}\dots s_{x,\mathrm{q}}$ . Now we can just reduce this term to $(\triangleright 4)(\triangleright\triangleright 7)$ to conclude

[TABLE]

term [2]: similar to $(\triangleright 4)(\triangleright\triangleright 2)$ .

term [3:1]: Noting the fact that

[TABLE]

and using Lemma D.7 this term is similar to (24).

term [3:2]: note that this term is equal to

[TABLE]

which is similar to $(\triangleright 4)(\triangleright\triangleright 2)[1]$ .

term [3:3], [3:4], [3:5], [3:6]: similar to $(\triangleright 4)(\triangleright\triangleright 7)$ .

term [4:1] is similar to $(\triangleright 4)(\triangleright\triangleright 2)[7]$ .

term [4:2], [4:4], [4:5] similar to $(\triangleright 4)(\triangleright\triangleright 3)[3]$ .

term [4:3] similar to $(\triangleright 4)(\triangleright\triangleright 2)[8]$

term [5] is similar to [4].

term [6] is also similar to $(\triangleright 4)(\triangleright\triangleright 3)[5]$ and $(\triangleright 4)(\triangleright\triangleright 2)[10]$ .

$(\triangleright\triangleright 8)$ term

[TABLE]

This term is similar to $\mathrm{D}(\triangleright 4)(z)$ as detailed in Lemma 3.4.

$(\triangleright\triangleright 1)$ term

[TABLE]

We have handled this term with regards to the differentiation of any term with respect to $u$ , we can instead first take that derivation with respect to $u$ and then take the derivative of $A$ which respect to $z$ which spits out the ${\mathrm{S}_{x,z}}$ .

Now based on the form of the metric written in Lemma 3.1, we first focus on the last term $\mathrm{A}_{x}^{\top}\mathbf{P}^{(2)}_{x}\mathbf{G}_{x}^{-1}\mathbf{P}^{(2)}_{x}\mathrm{A}_{x}$ . Note that above in handling all the derivatives in directions $z$ and $u$ of the $(\triangleright 4)$ term, we have bounded all the 3rd order derivative terms of $\mathrm{D}^{3}g(u,v,z)$ that has at least one derivative regarding the $\mathbf{P}^{(2)}_{x}$ terms in $\mathrm{A}_{x}^{\top}\mathbf{P}^{(2)}_{x}\mathbf{G}_{x}^{-1}\mathbf{P}^{(2)}_{x}\mathrm{A}_{x}$ . Hence, regarding this term, it remains to take derivative with only with respect to $\mathbf{G}_{x}^{-1}$ and the $\mathrm{A}_{x}$ ’s which we do next. Again, the sums mean we are considering all the terms corresponding to all the permutations of $u,v,z$ regarding the current term.

[TABLE]

$(\triangleright 3)$ is handled in a similar way as $(\triangleright 4)$ .

To handle the rest of the derivatives more conveniently at this point, we consider the second form of metric in Equation 3.1. First we aim to handle all the possible derivatives in three directions which differentiate the $\mathbf{P}^{(2)}_{x}$ terms at least once. Taking one time derivative in direction $v$ from the $\mathbf{P}^{(2)}_{x}$ term results in term $(\triangleright 4)$ and $(\triangleright 3)$ in Lemma (3.3).

But using Lemmas D.11 and D.9 and similar technique as we did, these terms are bounded by plus and minus of two constants times the matrix $\|s_{u}\|_{\infty}\|s_{v}\|_{\infty}\|s_{z}\|_{\infty}\mathrm{A}_{x}^{\top}\mathbf{W}_{x}\mathrm{A}_{x}$ .

Next, we move on to the other terms in the formulation of $g_{2}$ in 3.1, namely $\mathrm{A}_{x}^{\top}\mathbf{W}_{x}\mathrm{A}_{x}$ , $\mathrm{A}_{x}^{\top}\mathbf{\Lambda}_{x}\mathrm{A}_{x}$ , $\mathrm{A}_{x}^{\top}\mathbf{G}_{x}\mathrm{A}_{x}$ , and $\mathrm{A}_{x}^{\top}\mathbf{P}^{(2)}_{x}\mathrm{A}_{x}$ . third order self concordance of $\mathrm{A}_{x}^{\top}\mathbf{W}_{x}\mathrm{A}_{x}$ is a direct consequence of Lemma D.12. Term $\mathrm{A}_{x}^{\top}\mathbf{\Lambda}_{x}\mathrm{A}_{x}$ and $\mathrm{A}_{x}^{\top}\mathbf{G}_{x}\mathrm{A}_{x}$ are handled by Lemma D.11, and $\mathrm{A}_{x}^{\top}\mathbf{P}^{(2)}_{x}\mathrm{A}_{x}$ is handled by Lemma D.10. ∎

Appendix D Derivative Stability Lemmas

D.1 Infinity norm comparisons

Here we show a control over the infinity to infinity norm, i.e. $\|.\|_{\infty\rightarrow\infty}$ of the matrix $G^{-1}W$ , which is a crucial property that we use all over the proof to derive our derivative estimates with respect to the $\|.\|_{x,\infty}$ norm.

Lemma D.1.

For $y=\mathbf{G}_{x}^{-1}\mathbf{W}_{x}s$ , given any vector $s$ and $p<4$ , we have

[TABLE]

Proof.

Set $\|s\|_{\infty}=\ell$ . then

[TABLE]

Now suppose $\|y\|_{\infty}\geq\frac{1}{4/p-1}\ell$ , which implies that for the maximizing index $i$ we have

[TABLE]

But note that

[TABLE]

hence

[TABLE]

On the other hand

[TABLE]

The contradiction finishes the proof. ∎

D.2 Lowner Inequalities

In this section, we drive important estimates on the derivatives of fundamental matrix quantities that we arrive at such as $\mathbf{G}_{x},\mathbf{\Lambda}_{x},{\mathrm{R}_{x,v}},{\mathrm{S}_{x,v}}$ that we defined, and use them in our proof for strong self-concordance.

Lemma D.2.

We have

[TABLE]

Proof.

For the matrix $\mathbf{\tilde{P}}_{x,v}$ we have

[TABLE]

and similarly

[TABLE]

∎

Lemma D.3.

We have

[TABLE]

Proof.

For the first inequality, note that the sum of entries of the $i$ th row of matrix $\mathbf{P}^{(2)}_{x}$ is equal to ${\mathbf{W}_{x}}_{ii}$ . Hence, the matrix $\mathbf{W}_{x}-\mathbf{P}^{(2)}_{x}$ is a Laplacian so it is positive semi-definite. The second inequality follows from the fact that $\mathbf{P}^{(2)}_{x}$ is PSD. The third inequality, using the fact that $\mathbf{P}^{(2)}_{x}\preccurlyeq\mathbf{W}_{x}$ :

[TABLE]

∎

Lemma D.4.

For the derivatives of $\mathbf{G}_{x}$ and $\mathbf{\Lambda}_{x}$ at some point $x$ we have

[TABLE]

Proof.

Directly from Lemmas D.2 and D.13. ∎

Lemma D.5.

[TABLE]

Proof.

We use the terms of the derivative of $R_{v}$ in direction $z$ (according to Lemma D.14) and differentiate them one by one with respect to $u$ :

[TABLE]

Now from Lemmas D.1 and D.21 and D.13 we have

[TABLE]

the third and forth terms are similar to the first and second terms resp., for the fifth term

[TABLE]

the derivatives of the other terms are handled in a similar way. ∎

Lemma D.6.

For a symmetric matrix $D$ with $-\mathbf{W}_{x}\leq D\leq\mathbf{W}_{x}$ , we have

[TABLE]

Proof.

For arbitrary vectors $\mathrm{q}_{1},\mathrm{q}_{2}$ , using the inequality $\mathrm{q}_{1}^{\top}D\mathrm{q}_{2}\leq\sqrt{\mathrm{q}_{1}^{\top}\mathbf{W}_{x}\mathrm{q}_{1}}\sqrt{\mathrm{q}_{2}^{\top}\mathbf{W}_{x}\mathrm{q}_{2}}$ with $\mathrm{q}_{1}={\mathrm{R}_{x,v}}\mathrm{q}$ and $\mathrm{q}_{2}=\mathrm{q}$ :

[TABLE]

∎

Lemma D.7.

For diagonal matrices $D_{1},D_{2},D_{3}$ (not necessarily positive) we have

[TABLE]

Proof.

Consider the Choleskey decomposition of $\mathbf{P}_{x}$ :

[TABLE]

Then for the first inequality, note that we can write $\mathbf{P}_{x}D_{1}\mathbf{P}_{x}$ as

[TABLE]

Hence, for arbitrary vector $\ell$ :

[TABLE]

For the second inequality, note that

[TABLE]

which implies

[TABLE]

Therefore

[TABLE]

Now again using Equation (97):

[TABLE]

∎

Lemma D.8.

Given a matrix $-\mathbf{W}_{x}\leq D\leq\mathbf{W}_{x}$ and arbitrary diagonal matrices $V_{1}$ and $V_{2}$ and arbitrary vector $\ell$ :

[TABLE]

Proof.

simply by Cauchy Schwarz:

[TABLE]

∎

Lemma D.9.

For matrices $G$ and $\mathbf{\Lambda}_{x}$ we have

[TABLE]

Proof.

Note that

[TABLE]

But using Lemma D.7:

[TABLE]

On the other hand, note that

[TABLE]

simply by checking the operator norm of LHS. Hence

[TABLE]

On the other hand,

[TABLE]

Therefore, by Schur product theorem

[TABLE]

Moreover,

[TABLE]

and note that

[TABLE]

Hence, by Lemma D.6

[TABLE]

Finally, note that from Lemma D.15:

[TABLE]

All the inequalities that we wrote also hold in the other direction with a negative sign. Combining all the inequalities concludes the proof for $\mathbf{G}_{x}$ . As $\mathbf{\Lambda}_{x}$ is also a linear combination of $W$ and $\mathbf{P}^{(2)}_{x}$ , using the exact same bounds we can obtain the conclusion for $\mathbf{\Lambda}_{x}$ as well. ∎

Lemma D.10.

We have

[TABLE]

Proof.

We have

[TABLE]

Note that from Lemma D.6, a generic term in the above is of the form

[TABLE]

for diagonal matrices $D_{1}$ and $D_{4}$ , such that

[TABLE]

Hence, combining Lemmas D.6 and D.7, we get

[TABLE]

Similarly, we can show

[TABLE]

∎

Lemma D.11.

We have

[TABLE]

Proof.

Directly from Lemmas D.12 and D.10. ∎

Lemma D.12.

We have

[TABLE]

Proof.

For the first term of the first derivative in (100), further taking derivative. with respect to $u$ :

[TABLE]

where we used Lemmas D.21 and D.17. For the second term in (100):

[TABLE]

For the third term:

[TABLE]

where for this term we also used Lemma D.15.

Finally the last term $\mathbf{W}_{x}^{-1}\mathrm{D}(\mathbf{\Lambda}_{x}\mathbf{G}_{x}^{-1}\mathbf{W}_{x}{\mathrm{S}_{x,z}}s_{x,v})(u)$ is exactly similar to the proof of Lemma D.15 for handling $\mathbf{W}_{x}^{-1}\mathrm{D}(\mathbf{W^{\prime}}_{x,v})(z)$ . ∎

Lemma D.13.

We have

[TABLE]

In particular, for random $s_{v}$ we have with high probability

[TABLE]

Moreover

[TABLE]

Proof.

Note that $\mathbf{W^{\prime}}_{x,v}=-2\texttt{Diag}\big{(}{(}\big{)}\mathbf{\Lambda}_{x}r_{x,v})$ . Using Lemma D.1, we have $\|r_{x,v}\|_{\infty}\leq\frac{1}{4/p-1}\|s_{x,v}\|_{\infty}$ . Hence, for every $i$ :

[TABLE]

which completes the proof. For random $s_{x,v}$ , just note that

[TABLE]

For $g$ -norm also use Lemma 7.4 to upper bound infinity norm with $w$ -norm. ∎

Lemma D.14.

For the derivative of $R_{v}$ in direction $z$ we have

[TABLE]

Proof.

We can write

[TABLE]

But note that from Lemma D.1 we have $\|\mathbf{G}_{x}^{-1}\mathbf{W}_{x}s_{x,v}\|_{\infty}\leq\|s_{x,v}\|_{\infty}$ and from Lemma D.21 we have

[TABLE]

which completes the proof. ∎

Lemma D.15.

We have

[TABLE]

Proof.

We consider $Diag(D(w^{\prime})(z)/w)$ :

[TABLE]

Now from Lemmas D.18 and D.1:

[TABLE]

∎

Lemma D.16.

We have

[TABLE]

Proof.

We use the $\sum$ notation below to consider all the permutations among $u$ , $v$ , and $z$ .

[TABLE]

But note that in general for diagonal matrices $D_{1},D_{2}$ we have from Lemmas D.18, D.19, and D.20:

[TABLE]

as the proof of Lemma D.19 can be generalized to arbitrary diagonal matrices $D_{1}$ and $D_{2}$ in place of $R_{z}$ and $R_{u}$ . The proof is complete. ∎

Lemma D.17.

We have

[TABLE]

Proof.

Directly from Lemma D.16, noting the fact that both $G$ and $\Lambda$ are linear combinations of $W$ and $P^{(2)}$ . ∎

Lemma D.18.

We have

[TABLE]

Proof.

[TABLE]

∎

Lemma D.19.

We have

[TABLE]

Proof.

Observe that the 2-norm of the $i$ th row of the matrix $\mathbf{P}_{x}{\mathrm{R}_{x,z}}\mathbf{P}_{x}$ is at most $\|s_{x,z}\|_{\infty}\sqrt{w_{i}}$ . This is because

[TABLE]

Now note that

[TABLE]

∎

Lemma D.20.

We have

[TABLE]

Proof.

Note that by Cauchy Schwarz

[TABLE]

∎

Lemma D.21.

We have

[TABLE]

Proof.

Note that

[TABLE]

Now from Lemma D.18, we know

[TABLE]

Now similar to Lemma D.1, we can show

[TABLE]

On the other hand, note that

[TABLE]

so similarly we can argue

[TABLE]

Finally, as both $\mathbf{G}_{x}$ and $\mathbf{\Lambda}_{x}$ are a combination of $\mathbf{W}_{x}$ and $\mathbf{P}^{(2)}_{x}$ matrices, this completes the proof. ∎

Lemma D.22.

We have

[TABLE]

D.3 Norm of the bias

Lemma D.23.

We have

[TABLE]

Proof.

For the first part

[TABLE]

from Lemma 5.8. For the second part, writing $tr(g^{-1}Dg)$ as an expectation

[TABLE]

we have for independent $v,v^{\prime}\sim\mathcal{N}(0,g^{-1})$ :

[TABLE]

where we used Lemma D.27. This completes the proof. ∎

D.4 Comparison between leverage scores

Lemma D.24.

Let

[TABLE]

Then

[TABLE]

which implies

[TABLE]

Proof.

Simply note that $g\geq(\frac{n}{m})^{2/p}\mathrm{A}_{x}^{\top}\mathbf{W}_{x}^{1-2/p}\mathrm{A}_{x}$ , which implies

[TABLE]

∎

D.5 Norm comparison between covariant and normal derivatives

Lemma D.25.

Given a family of Hamiltonian curves $\gamma_{s}$ in the interval $(0,\delta)$ where $\gamma_{0}$ is $(\delta,c)-$ nice, with $v=v_{s}=\gamma^{\prime}_{s}(t)$ , we have

[TABLE]

Proof.

From Lemma 5.1 we have $R_{1}\leq\sqrt{n}$ along the curve, so by Lemma 23 in [21] (note that the condition $\delta^{2}\lesssim 1/R_{1}$ is satisfied) we get

[TABLE]

But now from Lemma D.26

[TABLE]

As always, our parameterization in $s$ is always unit norm, so $\|\frac{d}{ds}\gamma_{s}(t)\|_{g}$ , and from niceness of the curve $\|s_{v}\|_{\infty}\lesssim c$ , which completes the proof. ∎

Lemma D.26.

For a vector field $v$ and arbitrary vector $z$ at a point $x$ , denoting $\mathrm{D}v(z)$ by $v^{\prime}$ , we have

[TABLE]

Proof.

We have

[TABLE]

so

[TABLE]

∎

D.6 Log barrier infinity self-concordance

Proof of Lemma 3.9.

The log barrier metric is

[TABLE]

Its directional derivative is given by

[TABLE]

which can be bounded as

[TABLE]

Similarly, the second and third directional derivatives of $g_{2}$ are given by

[TABLE]

which can be bounded as

[TABLE]

This completes the proof. ∎

D.7 Other helper Lemmas

Lemma D.27.

For vector $v\sim\mathcal{N}(0,g^{-1})$ , we have with high probability

[TABLE]

Proof.

Directly from Gaussian moment bounds. ∎

Lemma D.28.

For the $p$ -Lewis weights barrier $\phi_{p}=\log\det\mathrm{A}_{x}^{\top}\mathbf{W}_{x}^{1-2/p}\mathrm{A}_{x}$ , we have

[TABLE]

Proof.

Proof is done in [19]. ∎

Lemma D.29.

For any positive integer $n$ , vector $v$ , and matrix $\tilde{g}\preccurlyeq g$ we have

[TABLE]

Proof.

Directly from the fact that if $A\preccurlyeq B$ , then for any matrix $C$ we have $C^{\top}AC\preccurlyeq C^{\top}BC$ . ∎

Lemma D.30.

For operator $g^{-1}\mathrm{D}g(v)g^{-1}\mathrm{D}g(v)$ , we have $\|g^{-1}\mathrm{D}g(v)g^{-1}\mathrm{D}g(v)\ell\|_{g}\leq\|s_{x,v}\|_{\infty}^{2}\|\ell\|_{g}$ .

Proof.

We have

[TABLE]

∎

Lemma D.31.

For vector field $w$ on manifold $\mathcal{M}$ , we have

[TABLE]

Proof.

We have

[TABLE]

where in the last line we used Lemma D.27. ∎

Lemma D.32.

For arbitrary vector field $w$ on $\mathcal{M}$ we have

[TABLE]

Proof.

We can write

[TABLE]

Lemma D.33.

For vector field $w$ we have

[TABLE]

But note that

[TABLE]

Hence

[TABLE]

where we used Lemma D.27 and Lemma 7.4. ∎

Appendix E Remaining Proofs

E.1 Proof of Theorem 2.7

Consider a subset $S\subseteq\mathcal{S}$ with $0.5\geq\pi(S)=s^{\prime}\geq s\geq 2\rho$ . Then, to show a lower bound for $s$ -conductance, we need to lower bound

[TABLE]

where $P(.,.)=\int_{x\in S}\mathcal{T}_{x}(S^{c})\pi(x)dx$ is the probability that we are in set $S$ and the next step of the Markov chain we escape $S$ and $P$ is the probability measure corresponding to $\pi$ . Recall that $\mathcal{T}_{x}(.)$ is the Markov kernel, specifying the distribution of the next step given we are at point $x$ . Now assume that the conductance bound does not hold, i.e. there exists such $S$ with

[TABLE]

Note that because the chain is reversible, we have

[TABLE]

and because $\pi(S)\leq 0.5$ , we have

[TABLE]

Next, define the set $\tilde{S}\subseteq S$ to be the points $x$ from which our chance of escaping $S$ is at least $0.01$ . Now if $\pi(S)\geq\Delta\psi_{\mathcal{M}}\pi(S)/2$ , then given that we are in $S$ , we have at least $\Delta\psi_{\mathcal{M}}$ chance of escaping $S$ which contradicts (103). This means

[TABLE]

On the other hand, note that for point $x_{1}$ with $d(x_{1},x_{0})\leq\Delta$ for $x_{0}\in S-\tilde{S}$ , we have

[TABLE]

which means $x_{1}$ cannot be in $S-\tilde{S}$ , hence it should be in $S^{c}$ . Therefore, defining the set $S^{+\Delta}$ as the set of points outside $\tilde{S}$ which are $\Delta$ close to a point in $S-\tilde{S}-\mathcal{M}^{\prime c}$ , we have

[TABLE]

On the other hand, from isoperimetry (because $\pi(S)\leq\frac{1}{2}$ ) and the fact that $\Delta\psi_{\mathcal{M}}\leq 1/2$ we have

[TABLE]

Therefore, from the assumption $s\geq\rho/(8\Delta\psi_{\mathcal{M}})$ :

[TABLE]

which implies from Equations (105) and (106):

[TABLE]

which proves that the conducance is lower bounded by $\Omega(\Delta\psi_{\mathcal{M}})$ .

E.2 Properties of Lewis weights

In this section, we recall some properties of Lewis weights which we use in the proof.

Lemma E.1 (Fixed point property of Lewis weights).

The Lewis weights of the matrix $A_{x}$ is the unique vector $w$ in $\mathbb{R}^{m}_{\geq 0}$ with $W=\texttt{Diag}\big{(}{w}\big{)}$ such that

[TABLE]

where $\sigma(.)$ denotes the leverage scores of the matrix.

Proof.

Recall the definition of Lewis weights as the optimum of the objective in Equation (16). Taking derivative with respect to $W$ , we get

[TABLE]

where $\sigma\triangleq(W^{1/2-1/p}A_{x})$ is the vector of leverage scores defined as

[TABLE]

∎

Proof of Lemma 3.1.

The first form of the Lewis weight metric $g_{1}$ directly follows from Equation 5.5 in Lemma 31. in [19]. To see why the second form in Equation (21) holds, note that

[TABLE]

Hence

[TABLE]

which implies

[TABLE]

Plugging Equation (107) into the first form in Equation (20) completes the proof. ∎

Proof of Lemma 3.1.

The first formulation follows from [18]. To show the second formulation, recall the definition of $\mathbf{\Lambda}_{x}$ :

[TABLE]

Plugging the above into the first formulation results in the second formulation. ∎

Proof of Lemma 2.1.

Directly from Lemma 31 in [19]. ∎

Lemma E.2 (Gradient of the $p$ Lewis weights barrier).

The gradient of the $p$ Lewis weights barrier $\phi_{p}$ is given by

[TABLE]

Proof.

Taking directional derivative in direction $v$ , using the chain rule

[TABLE]

But because $w_{x}$ is the maximizer of $\big{(}-\text{logdet}(\mathrm{A}_{x}^{\top}W^{1-2/p}\mathrm{A}_{x})+(1-2/p)\mathbbm{1}^{\top}w\big{)}$ , the second term is zero and the proof is complete. ∎

E.2.1 Proof of Lemma 3.3

To differentiate $g_{1}$ in direction $v$ , we differentiate each of the matrices in the product regarding the formula of $g_{1}$ one by one. Starting from $\mathrm{A}_{x}$ , we use the first formulation in Equation (20) and we get $(\triangleright 1)$ term. Next, differentiating $W$ and $2\mathbf{\Lambda}_{x}$ in $\mathrm{A}_{x}^{\top}(\mathbf{W}_{x}+2\mathbf{\Lambda}_{x})\mathrm{A}_{x}$ we get

[TABLE]

which is the $(\star 2)$ term. Furthermore, differentiating $\mathrm{A}_{x}^{\top}\mathbf{\Lambda}_{x}\mathbf{G}_{x}^{-1}\mathbf{\Lambda}_{x}\mathrm{A}_{x}$ with respect to $\Lambda$ , we get $(\star 3)$ and $(\star 4)$ terms. Finally note that the derivative of $G^{-1}$ is:

[TABLE]

Therefore, differentiating the $G^{-1}$ part in $A^{T}\Lambda G^{-1}\Lambda A$ we get the $(\triangleright 5)^{\prime}$ , $(\triangleright 6)$ , and $(\triangleright 7)$ terms.

E.3 Derivative of

Lemma E.3.

The derivative of the term $(\star 5)$ defined in Lemma 3.3, ignoring the constants is equal to

[TABLE]

Appendix F Self-concordance Parameter of $\phi$

Here we provide a bound for the self-concordance parameter of $\phi$ .

Lemma F.1 (Self-concordance parameter of $\phi$ ).

For our hybrid barrier $\phi$ , the self-concordance parameter is defined as

[TABLE]

is bounded by $\alpha_{0}n$ .

Proof.

Note that for the Lewis weights and log barrier parts of the barrier $\phi=\alpha_{0}\phi_{p}+\alpha_{0}\frac{n}{m}\phi_{\ell}$ we can bound the barrier parameter separately as

[TABLE]

Now for the log barrier part, we have

[TABLE]

and for the $p$ Lewis weight barrier part, from Lemmas E.2 and 2.1:

[TABLE]

Combining Equations (108) and (109) completes the proof. ∎

F.1 Iteration complexity of Gaussian Cooling

Proof of Corollary 1.1.1.

First, note that from Lemma F.1, $\phi$ is self-concordant with self-concordant parameter $\nu=\alpha_{0}n$ . The Gaussian cooling schedule introduce by authors in [21] can be used to relax the requirement of a warm start for our sampling algorithm, hence obtain an efficient volume algorithm. The idea is that sampling from Gibbs distributions $e^{-\alpha\phi(x)}$ with smaller variance or larger $\alpha$ is easier, so one can start from sampling a large temperature $\alpha$ and gradually decrease it. The Gaussian cooling of [21] evolves in phases where in the $i$ th phase it generates $k_{i}$ approximate samples from the density proportional to $e^{-\phi(x)/\sigma_{i}^{2}}$ inside the polytope, where

[TABLE]

and the update rule for $\sigma_{i}$ is

[TABLE]

starting from $\sigma_{0}^{2}=\Theta(\epsilon^{2}n^{-3}\log^{-3}(n/\epsilon))$ until $\sigma$ goes above $\Theta(\frac{\nu}{\epsilon}\log(\frac{n\nu}{\epsilon}))$ . Note that the temperature parameter is given by $\alpha=1/\sigma^{2}$ . Now at each phase $i$ going from temperature $\sigma_{i}^{2}$ to $\sigma_{i+1}^{2}$ we have a an approximate samples from $e^{-\phi(x)/\sigma_{i}^{2}}$ which can be used as warm starts for sampling from $e^{-\phi(x)/\sigma_{i+1}^{2}}$ , specially as $k_{i+1}\leq k_{i}$ . Hence, our main Theorem 1.1 implies that the mixing time of sampling at each phase is of order

[TABLE]

Now in the first case when $\sigma_{i}^{2}\leq\frac{\nu}{n}=\alpha_{0}$ , we have $\alpha\geq\frac{1}{\alpha_{0}}$ . On the other hand, due to the update rule of $\sigma_{i}$ in this case, it takes $\sqrt{n}$ phase to double $\sigma$ and in each phase we take samples $k_{i}=\tilde{\Theta}(\frac{\sqrt{n}}{\epsilon^{2}})$ . Hence, the total number of RHMC steps to double $\sigma$ in this case is bounded by

[TABLE]

In the other case when $\sigma_{i}^{2}\geq\frac{\nu}{n}=\alpha_{0}$ , we have $\alpha\leq\frac{1}{\alpha_{0}}$ . Then, the total RHMC steps to double $\sigma$ in this case can be upper bounded after substituting $\nu=n\alpha_{0}$ as

[TABLE]

This means we can calculate the integral of $e^{-\alpha\phi(x)}$ for any $\alpha$ using $\tilde{O}(\frac{n^{4/3}m^{1/3}}{\epsilon^{2}})$ steps of RHMC up to $1\pm\epsilon$ . Moreover, if we just want to sample from $e^{-\alpha\phi(x)}$ in the polytope, we don’t require to take $k_{i}$ number of samples at phase $i$ but only need one sample, so the $\epsilon^{2}$ in the complexity is omitted and we end up with the complexity $\tilde{O}(n^{4/3}m^{1/3})$ for sampling without warm start. ∎

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Dominique Bakry, Ivan Gentil, Michel Ledoux, et al. Analysis and geometry of Markov diffusion operators , volume 103. Springer, 2014.
2[2] W Ballmann. Riemannian geometry and geometric analysis by j. jost; riemannian geometry by p. petersen; riemannian geometry by t. sakai. BULLETIN-AMERICAN MATHEMATICAL SOCIETY , 37(4):459–466, 2000.
3[3] Jeff Cheeger, David G Ebin, and David Gregory Ebin. Comparison theorems in Riemannian geometry , volume 9. North-Holland Amsterdam, 1975.
4[4] Sinho Chewi. Log-concave sampling. Book draft available at https://chewisinho. github. io , 2022.
5[5] Sinho Chewi, Murat A Erdogdu, Mufan Li, Ruoqi Shen, and Shunshi Zhang. Analysis of Langevin Monte Carlo from Poincaré to Log-Sobolev. In Conference on Learning Theory (COLT) , pages 1–2. PMLR, 2022.
6[6] Ben Cousins and Santosh Vempala. Gaussian cooling and o^*(n^3) algorithms for volume and gaussian volume. SIAM Journal on Computing , 47(3):1237–1273, 2018.
7[7] Arnak Dalalyan. Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. In Conference on Learning Theory (COLT) , pages 678–689. PMLR, 2017.
8[8] Manfredo P Do Carmo. Differential geometry of curves and surfaces: revised and updated second edition . Courier Dover Publications, 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Sampling with Barriers: Faster Mixing via Lewis Weights

Abstract

Contents

1 Introduction

1.1 Background and Related Work

1.2 Background on Riemannian Hamiltonian Monte Carlo

1.3 Results

Definition 1** (Hybrid barrier).**

Theorem 1.1** (Mixing).**

Corollary 1.1.1** (Any start; Volume).**

Theorem 1.2**.**

Lemma 1.3** (Manifold self-concordance of Hybrid barrier).**

Lemma 1.4** (Infinity norm Third-order Self-concordance of Hybrid barrier).**

1.4 Technical overview

Mixing and Conductance.

Isoperimetry.

Smoothness of Hamiltonian Curves and Comparison Geometry.

Hamiltonian curves and variations.

Lemma 1.5**.**

Definition 2** (Family of Hamiltonian curves).**

Definition 3** (Operators Φ\PhiΦ and MxM_{x}Mx​).**

Lemma 1.6** (ODE for Hamiltonian fields).**

Definition 4**.**

Definition 5** (Nice Hamiltonian curve).**

Lemma 1.7** (Stability of norms).**

Theorem 1.8** (Smoothness).**

Proof.

Structure of the paper.

2 Preliminaries

2.1 John Ellipsoid and Lewis weights

Definition 6** (Lewis weights barrier).**

Lemma 2.1** (Lewis weights metric).**

Definition 7** (Projection matrix).**

Lemma 2.2** (Derivative of the Lewis weights).**

Lemma 2.3** (Derivative of the projection matrix).**

2.2 Markov chains

Theorem 2.4**.**

Definition 8**.**

Lemma 2.5**.**

Definition 9** (sss-conductance).**

Lemma 2.6**.**

Theorem 2.7**.**

3 Hybrid barrier metric and second-order self-concordance

Lemma 3.1** (ppp-Lewis-weight metric).**

Lemma 3.2** (Operator infinity norm bound).**

Proof.

Lemma 3.3** (Derivative of the ppp-Lewis weights metric).**

Lemma 3.4**.**

Proof.

Lemma 3.5** (Third derivative bound for Lewis weights).**

Lemma 3.6** (First order infinity norm self-concordance).**

Proof.

Lemma 3.7** (Second order infinity norm self-concordance).**

Proof.

Lemma 3.8** (Second derivative of (▹4)(\triangleright 4)(▹4)).**

Lemma 3.9** (Infinity self-concordance of the log barrier).**

Proof of Lemmas 1.4 and 1.3.

4 Bounding conductance and mixing time

Definition 10** (Nice set).**

Theorem 4.1**.**

Proof.

Definition 11**.**

Lemma 4.2** (Lemma 22 in [21]).**

Lemma 4.3** (Lemma 32 in [21]).**

Lemma 4.4** (Change of the pushforward density).**

Proof.

Lemma 4.5** (Change in probability of events under approximate density).**

Proof.

Lemma 4.6** (One-step coupling for RHMC).**

Proof.

Proof of Theorem 1.1.

5 On the Geometry and Stability of Hessian Manifolds

5.1 Bounding R1R_{1}R1​

Definition 1 (Hybrid barrier).

Theorem 1.1 (Mixing).

Corollary 1.1.1 (Any start; Volume).

Theorem 1.2.

Lemma 1.3 (Manifold self-concordance of Hybrid barrier).

Lemma 1.4 (Infinity norm Third-order Self-concordance of Hybrid barrier).

Lemma 1.5.

Definition 2 (Family of Hamiltonian curves).

Definition 3 (Operators $\Phi$ and $M_{x}$ ).

Lemma 1.6 (ODE for Hamiltonian fields).

Definition 4.

Definition 5 (Nice Hamiltonian curve).

Lemma 1.7 (Stability of norms).

Theorem 1.8 (Smoothness).

Definition 6 (Lewis weights barrier).

Lemma 2.1 (Lewis weights metric).

Definition 7 (Projection matrix).

Lemma 2.2 (Derivative of the Lewis weights).

Lemma 2.3 (Derivative of the projection matrix).

Theorem 2.4.

Definition 8.

Lemma 2.5.

Definition 9 ( $s$ -conductance).

Lemma 2.6.

Theorem 2.7.

Lemma 3.1 ( $p$ -Lewis-weight metric).

Lemma 3.2 (Operator infinity norm bound).

Lemma 3.3 (Derivative of the $p$ -Lewis weights metric).

Lemma 3.4.

Lemma 3.5 (Third derivative bound for Lewis weights).

Lemma 3.6 (First order infinity norm self-concordance).

Lemma 3.7 (Second order infinity norm self-concordance).

Lemma 3.8 (Second derivative of $(\triangleright 4)$ ).

Lemma 3.9 (Infinity self-concordance of the log barrier).

Definition 10 (Nice set).

Theorem 4.1.

Definition 11.

Lemma 4.2 (Lemma 22 in [21]).

Lemma 4.3 (Lemma 32 in [21]).

Lemma 4.4 (Change of the pushforward density).

Lemma 4.5 (Change in probability of events under approximate density).

Lemma 4.6 (One-step coupling for RHMC).

5.1 Bounding $R_{1}$

Lemma 5.1.

Lemma 5.2 (Frobenius norm of random Riemann tensor).

Lemma 5.3 (Subterms for operator $M$ ).

Lemma 5.4 (Frobenius norm of operator $M$ ).

5.2 Bounding $R_{2}$

Lemma 5.5.

5.2.1 Bounding the change in Operator $M_{x}$

Lemma 5.6 (Bound on the change of operator $M$ ).

Lemma 5.7 (Trace of $A_{1}$ ).

Lemma 5.8.

Lemma 5.9 (Trace of $A_{2}$ ).

Lemma 5.10.

Lemma 5.11.

Lemma 5.12.

Lemma 5.13.

Lemma 5.14 (Bound on the change of Ricci tensor).

$A_{1}:=-\frac{1}{4}\texttt{tr}(g^{-1}\mathrm{D}g(v_{1})g^{-1}\mathrm{D}g(v_{2}))$ term

Terms in the derivative of $A_{1}$ that involves the derivative of $v$

Before taking derivative w.r.t $z$

Taking derivative in direction $z$ .

5.3 Bounding $R_{3}$

Lemma 5.15 (Bound on $R_{3}$ ).

Lemma 5.16 (Operator norm of random Riemann tensor).

Lemma 5.17 (Operator norm of $M$ ).

Lemma 5.18 (Infinity norm of the parallel transport).

Lemma 6.1 (Stability of norms).

Lemma 6.2.

Lemma 6.3 (Boundedness of manifold norm along the Hamiltonian curve).

Lemma 6.4 (Stability bound on the infinity norm along the curve).

Lemma 6.5 (Stability bound on the $g$ -norm along the curve).

Lemma 6.6.

Lemma 6.7.

Lemma 7.1 (Löwner comparison with different weighted matrices).

Lemma 7.2 (Taming the hybrid metric).

Lemma 7.3 (Bounding the ellipsoid norm by the infinity norm).