The Riemannian Langevin equation and conic programs

Govind Menon; Tianmin Yu

arXiv:2302.11653·math.PR·February 24, 2023

The Riemannian Langevin equation and conic programs

Govind Menon, Tianmin Yu

PDF

Open Access

TL;DR

This paper introduces the Riemannian Langevin equation (RLE) as a generalization of stochastic gradient methods to Riemannian manifolds, providing explicit formulas for Brownian motion on cones, advancing understanding of stochastic processes in geometric spaces.

Contribution

It formulates the Riemannian Langevin equation and derives explicit formulas for Brownian motion on fundamental cones, expanding stochastic analysis on manifolds.

Findings

01

Explicit formulas for Brownian motion on cones

02

Generalization of Langevin dynamics to Riemannian manifolds

03

Framework for analyzing stochastic processes on geometric spaces

Abstract

Diffusion limits provide a framework for the asymptotic analysis of stochastic gradient descent (SGD) schemes used in machine learning. We consider an alternative framework, the Riemannian Langevin equation (RLE), that generalizes the classical paradigm of equilibration in R^n to a Riemannian manifold (M^n, g). The most subtle part of this equation is the description of Brownian motion on (M^n, g). Explicit formulas are presented for some fundamental cones.

Equations86

x^{k + 1} = x^{k} - γ_{k} \nabla ε_{i_{k}} (x^{k}),

x^{k + 1} = x^{k} - γ_{k} \nabla ε_{i_{k}} (x^{k}),

d X_{t} = - \nabla E (X_{t}) d t + \frac{2}{β} d B_{t},

d X_{t} = - \nabla E (X_{t}) d t + \frac{2}{β} d B_{t},

d X_{t} = - grad E (X_{t}) d t + d B_{t}^{g, β} .

d X_{t} = - grad E (X_{t}) d t + d B_{t}^{g, β} .

L f = - grad E (f) + \frac{1}{β} Δ f, where Δ f = \frac{1}{det g} \partial_{i} (det g g^{ij} \partial_{j} f),

L f = - grad E (f) + \frac{1}{β} Δ f, where Δ f = \frac{1}{det g} \partial_{i} (det g g^{ij} \partial_{j} f),

\partial_{t} ρ = div (ρ grad F), F = E + \frac{1}{β} lo g ρ .

\partial_{t} ρ = div (ρ grad F), F = E + \frac{1}{β} lo g ρ .

ρ (x) = \frac{1}{Z _{β}} e^{- β E (x)}, Z_{β} = \int_{M} e^{- β E (x)} det g d x .

ρ (x) = \frac{1}{Z _{β}} e^{- β E (x)}, Z_{β} = \int_{M} e^{- β E (x)} det g d x .

F = \frac{1}{2} lo g det D^{2} F, x \in K^{o} .

F = \frac{1}{2} lo g det D^{2} F, x \in K^{o} .

x (θ) = argmin_{s \in C} {F (s) + θ c^{T} s} .

x (θ) = argmin_{s \in C} {F (s) + θ c^{T} s} .

\frac{d x}{d θ} = - grad_{g} (c^{T} x), g = D^{2} F, x (0) = argmin F .

\frac{d x}{d θ} = - grad_{g} (c^{T} x), g = D^{2} F, x (0) = argmin F .

g (Z_{1}, Z_{2}) = Tr (A_{N}^{- 1} (Z_{1}) Z_{2}), A_{N} (Z) = \frac{1}{N} i = 1 \sum N (W W^{T})^{\frac{N - i}{N}} Z (W^{T} W)^{\frac{i}{N}} .

g (Z_{1}, Z_{2}) = Tr (A_{N}^{- 1} (Z_{1}) Z_{2}), A_{N} (Z) = \frac{1}{N} i = 1 \sum N (W W^{T})^{\frac{N - i}{N}} Z (W^{T} W)^{\frac{i}{N}} .

f (X_{t}) = f (X_{0}) + \frac{1}{β} \int_{0}^{t} Δ f (X_{s}) d s + a local martingale .

f (X_{t}) = f (X_{0}) + \frac{1}{β} \int_{0}^{t} Δ f (X_{s}) d s + a local martingale .

[f (B^{g, β})]_{t} = \frac{1}{β} \int_{0}^{t} ∣\nabla f (B_{s}^{g, β}) ∣_{g}^{2} d s

[f (B^{g, β})]_{t} = \frac{1}{β} \int_{0}^{t} ∣\nabla f (B_{s}^{g, β}) ∣_{g}^{2} d s

[f_{1} (B^{g, β}), f_{2} (B^{g, β})]_{t} = \frac{1}{β} \int_{0}^{t} (\nabla f_{1} \cdot \nabla f_{2}) (B_{s}^{g, β}) d s

[f_{1} (B^{g, β}), f_{2} (B^{g, β})]_{t} = \frac{1}{β} \int_{0}^{t} (\nabla f_{1} \cdot \nabla f_{2}) (B_{s}^{g, β}) d s

g_{ij} = \frac{\partial ^{2} F}{\partial x ^{i} \partial x ^{j}} .

g_{ij} = \frac{\partial ^{2} F}{\partial x ^{i} \partial x ^{j}} .

grad F = - x, ∣\nabla F ∣^{2} = n and Δ F = 0.

grad F = - x, ∣\nabla F ∣^{2} = n and Δ F = 0.

x^{i} \partial_{i} F (x) = - n .

x^{i} \partial_{i} F (x) = - n .

(grad F)^{k} = g^{j k} \partial_{j} F = - x^{k} .

(grad F)^{k} = g^{j k} \partial_{j} F = - x^{k} .

∣\nabla F ∣^{2} = (g^{ij} \partial_{i} F) \partial_{j} F = - x^{j} \partial_{j} F = n .

∣\nabla F ∣^{2} = (g^{ij} \partial_{i} F) \partial_{j} F = - x^{j} \partial_{j} F = n .

Δ F = div (grad F)

Δ F = div (grad F)

= - \partial_{i} (x^{i}) - x^{i} \partial_{i} (\frac{1}{2} lo g det g) = - n - x^{i} \partial_{i} F = 0,

F (x) = - i = 1 \sum n lo g x^{i}, g = i = 1 \sum n \frac{d x ^{i} d x ^{i}}{( x ^{i} ) ^{2}} .

F (x) = - i = 1 \sum n lo g x^{i}, g = i = 1 \sum n \frac{d x ^{i} d x ^{i}}{( x ^{i} ) ^{2}} .

lo g x^{i} (B_{t}^{g, β}) - lo g x^{i} (B_{0}^{g, β}) = \frac{1}{β} B_{t}^{i} i = 1, ..., n

lo g x^{i} (B_{t}^{g, β}) - lo g x^{i} (B_{0}^{g, β}) = \frac{1}{β} B_{t}^{i} i = 1, ..., n

F (x) = - i = 1 \sum n lo g \frac{sin ( π x ^{i} )}{π}, g = i = 1 \sum n π^{2} \frac{d x ^{i} d x ^{i}}{sin ^{2} ( π x ^{i} )} .

F (x) = - i = 1 \sum n lo g \frac{sin ( π x ^{i} )}{π}, g = i = 1 \sum n π^{2} \frac{d x ^{i} d x ^{i}}{sin ^{2} ( π x ^{i} )} .

lo g (tan (\frac{π x ^{i} ( B _{t}^{g, β} )}{2})) - lo g (tan (\frac{π x ^{i} ( B _{0}^{g, β} )}{2})) = \frac{1}{β} B_{t}^{i}

lo g (tan (\frac{π x ^{i} ( B _{t}^{g, β} )}{2})) - lo g (tan (\frac{π x ^{i} ( B _{0}^{g, β} )}{2})) = \frac{1}{β} B_{t}^{i}

K_{n + 1} = {x \in R^{n + 1} ∣ (x^{0}) \geq i = 1 \sum n (x^{i})^{2}},

K_{n + 1} = {x \in R^{n + 1} ∣ (x^{0}) \geq i = 1 \sum n (x^{i})^{2}},

F (x)

F (x)

g_{ij}

g_{ij}

g^{ij}

det (g_{ij})

f_{b} (x) = \frac{n + 1}{2} lo g \frac{( b ^{T} x ) ^{2}}{x ^{T} A x}, b \in L_{n + 1}^{+} .

f_{b} (x) = \frac{n + 1}{2} lo g \frac{( b ^{T} x ) ^{2}}{x ^{T} A x}, b \in L_{n + 1}^{+} .

L_{n + 1}^{+} = {b \in R^{n + 1} ∣ b^{0} > 0, b^{T} B b = 0} .

L_{n + 1}^{+} = {b \in R^{n + 1} ∣ b^{0} > 0, b^{T} B b = 0} .

f^{0} = \frac{1}{n + 1} F, and f^{i} = f_{b_{i}} i = 1, \dots, n .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMarkov Chains and Monte Carlo Methods · Mathematical Biology Tumor Growth · Advanced Thermodynamics and Statistical Mechanics

Full text

11institutetext: Division of Applied Mathematics,

Brown University, Providence RI 02912, USA

11email: [email protected], [email protected]

The Riemannian Langevin equation and conic programs

Govind Menon

Tianmin Yu

Abstract

Diffusion limits provide a framework for the asymptotic analysis of stochastic gradient descent (SGD) schemes used in machine learning. We consider an alternative framework, the Riemannian Langevin equation (RLE), that generalizes the classical paradigm of equilibration in $\mathbb{R}^{n}$ to a Riemannian manifold $(\mathcal{M}^{n},g)$ . The most subtle part of this equation is the description of Brownian motion on $(\mathcal{M}^{n},g)$ . Explicit formulas are presented for some fundamental cones.

Keywords:

Stochastic gradient descent Riemannian Langevin equation.

1 Introduction

1.1 Stochastic gradient descent

Stochastic gradient descent (SGD) schemes in machine learning typically arise as follows. An empirical loss function $E:\mathbb{R}^{n}\to\mathbb{R}$ , for a training parameter $x$ , is defined through a finite sum $E(x)=\frac{1}{N}\sum_{i=1}^{N}\varepsilon_{i}(x)$ , where $\varepsilon_{i}(x)=\varepsilon(x;y_{i},z_{i})$ denotes a loss function evaluated on a finite set of training data $\{(y_{i},z_{i})\}_{i=1}^{N}$ . The loss function is minimized using the stochastic gradient descent scheme

[TABLE]

where $i_{k}$ is chosen randomly from the set $\{1,\ldots,N\}$ and $\gamma_{k}$ is a time step.

Several variants of SGD have been explored since the classic work of Robbins and Monro [13]. What is different in modern maching learning is the large size of $n$ and $N$ and the protocol for the learning rate $\gamma$ . Diffusion limits of SGD replace the discrete iteration above with stochastic differential equations (SDE); these SDE depend on the manner in which $\gamma_{k}\to 0$ , $n\to\infty$ and $N\to\infty$ . Some examples of this approach are the stochastic modified equation (SME) proposed in [10], the variational analysis using Kullback-Leibler divergence proposed in [11], and homogenized SGD (HSGD) defined in [12].

1.2 Riemannian Langevin equation

SDE limits of SGD schemes begin with an algorithm and study its scaling limits. The approach in this paper is different. We begin with diffusions that extend the classical Langevin equation to a Riemannian setting. The relation to optimization lies in the nature of the underlying Riemannian geometry. Let us first explain the model; we then explain why it is a natural extension of ideas used in classical and modern optimization theory.

Recall that the Langevin equation associated to the loss function $E:\mathbb{R}^{n}\to\mathbb{R}$ , at inverse temperature $\beta>0$ , is formulated mathematically as the Itô SDE

[TABLE]

where $\{B_{t}\}_{t\geq 0}$ denotes standard Brownian motion on $\mathbb{R}^{n}$ . Given an $n$ -dimensional Riemannian manifold $(\mathcal{M},g)$ with metric $g$ and a loss function $E:\mathcal{M}\to\mathbb{R}$ we consider the Riemannian Langevin equation (RLE)

[TABLE]

Both the gradient and the Brownian motion at inverse temperature $\beta$ are now computed with respect to the Riemannian metric $g$ . In particular, the Brownian motion $B_{t}^{g,\beta}$ on $(\mathcal{M},g)$ must be defined carefully as discussed below. Let $f\in C^{\infty}(\mathcal{M})$ be a test function. The infinitesimal generator $L$ of the diffusion (3) is

[TABLE]

is the Laplace-Beltrami operator, $\mathrm{grad}$ denotes the gradient with respect to $g$ , and the volume form is computed in coordinates using $\det g=\det(g_{ij})$ .

The Fokker-Planck equation, $\partial_{t}\rho=L^{*}\rho$ , where the dual is with respect to the volume form of $g$ , takes the form

[TABLE]

The free energy, $F$ , is constant in equilibrium and we find the Gibbs density

[TABLE]

RLE is a method a method to study the Gibbs measure associated to $F$ , whereas SGD schemes seek the minimum of $F$ . However, these techniques are closely related. When $\beta\to\infty$ and $F$ has a unique global minimum at $x_{*}\in\mathcal{M}$ , the Gibbs measure concentrates at $x_{*}$ as $\beta\to\infty$ with rigorous asymptotics provided by large deviations theory. A subtle feature of the metrics arising in optimization is that the volume $\int_{M}\sqrt{\det g}\,dx$ may be infinite.

1.3 Riemannian geometries in optimization

The framework of RLE provides a natural geometric unity between conic programs and deep learning. What changes is the underlying Riemannian manifold $(\mathcal{M},g)$ . Let us explain this idea through examples.

Bayer and Lagarias systematized the Riemannian geometry discovered by Karmarkar for interior-point methods [3, 9]. We focus on the canonical barrier [7]. Associated to every regular convex cone $K\subset\mathbb{R}^{n}$ is a unique convex function $F$ defined in the interior $K^{o}$ of $K$ such that $F(x)\to+\infty$ as $x\to\partial K$ . This function is the Cheng-Yau solution to the Monge-Ampère equation

[TABLE]

Given a barrier and a vector $c\in\mathbb{R}^{n}$ , the conic program $\min_{x\in C}c^{T}x$ is solved by taking the $\theta\to\infty$ limit of the central path

[TABLE]

Further, $x(\theta)$ above is the solution to the Riemannian gradient flow

[TABLE]

That is, the Hessian of the barrier $F$ provides the underlying Riemannian metric. The canonical barrier has several striking geometric properties [7].

Riemannian metrics have also been extensively used in geometric deep learning [4]. A model problem that allows a comparison between deep learning and classical optimization is the deep linear network [1, 2, 5]. The training space for a network of depth $N$ is the product space of $d\times d$ matrices $\mathbb{M}_{d}^{N}$ . Given $\mathbf{W}=(W_{N},W_{N-1},\ldots,W_{1})$ the observable is the product $V=W_{N}W_{N-1}\cdots W_{1}$ . Learning problems like matrix completion may be modeled as a Euclidean gradient descent for a cost function $L(\mathbf{W}):=E(V)$ . Then for suitable initial conditions, the Euclidean gradient flow, $\dot{W}_{i}=-\nabla_{W_{i}}L(W)$ , $1\leq i\leq N$ corresponds to the Riemannian gradient flow, $\dot{V}=-\mathrm{grad}_{g}E(V)$ , where the metric $g$ acts by

[TABLE]

In order to explore the nature of the Riemannian Langevin equation in optimization, we must understand Brownian motion on Riemannian manifolds like those above. This is a problem of some depth. We illustrate this by computing explicit expressions for Brownian motion in some fundamental cones, using expressions for the barrier from [6].

2 Brownian motion and conic programs

Manifold-valued Brownian motion may be defined in several ways [8]. We use the following definition in this note: an $\mathcal{M}$ -valued semimartingale $\boldsymbol{X}_{t}$ is called a Brownian motion on $\mathcal{M}$ , with temperature $T=\frac{1}{\beta}$ , if for any $f\in C^{\infty}(\mathcal{M})$ ,

[TABLE]

We denote Brownian motion on $(\mathcal{M},g)$ at temperature $T=\frac{1}{\beta}$ by $\boldsymbol{B}^{g,\beta}_{t}$ .

Proposition 1

The quadratic variation process of $f(\boldsymbol{B}^{g,\beta}_{t})$ for $f\in C^{\infty}(\mathcal{M})$ is

[TABLE]

where $|\nabla f|^{2}:=g^{ij}\partial_{i}f\partial_{j}f$ .

Corollary 1

The covariation process of $f_{1}(\boldsymbol{B}^{g,\beta}_{t})$ and $f_{2}(\boldsymbol{B}^{g,\beta}_{t})$ for $f_{1},f_{2}\in C^{\infty}(\mathcal{M})$ is

[TABLE]

where $\nabla f_{1}\cdot\nabla f_{2}:=g^{ij}\partial_{i}f_{1}\partial_{j}f_{2}$ .

Proposition 1 allows us to analyze Brownian motion through a careful choice of coordinate functions. We will choose $f$ such that $\Delta f=0$ and $|\nabla f|^{2}=1$ , so that $f(\boldsymbol{B}^{g,\beta}_{t})-f(\boldsymbol{B}^{g,\beta}_{0})$ has the same law as an $\mathbb{R}$ -valued standard Brownian motion.

Let us now assume $\mathcal{M}$ is a regular convex cone $K\subset\mathbb{R}^{n}$ , let $F$ denote its canonical barrier, and equip $K$ with the Hessian metric

[TABLE]

Theorem 2.1

Consider Brownian motion $\boldsymbol{B}^{g,\beta}_{t}$ on $(K,g)$ . The process $\frac{\sqrt{\beta}}{\sqrt{n}}(F(\boldsymbol{B}^{g,\beta}_{t})-F(\boldsymbol{B}^{g,\beta}_{0}))$ has the same law as a standard Brownian motion on the line.

Proof

We will use the logarithmic homogeneity of $F$ and the Monge-Ampère equation to establish the identities

[TABLE]

Theorem 2.1 follows immediately from these identities.

The first identity uses logarithmic homogeneity. Since $F(\lambda x)=F(x)-n\log\lambda$ for $\lambda\in\mathbb{R}^{+}$ , $x\in K$ , we may differentiate with respect to $\lambda$ and set $\lambda=1$ to find

[TABLE]

Next, the differential of $F$ with respect to $x^{j}$ is $\partial_{j}F(x)=-x^{i}\partial_{i}\partial_{j}F(x)=-g_{ij}x^{i}$ by equation (14). Thus, each component of the gradient of $F$ is

[TABLE]

This proves the first identity in equation (15). It immediately follows that

[TABLE]

Finally, we show that $\Delta F=0$ as follows

[TABLE]

where we have used the Monge-Ampére equation (7) and equation (18).

The above theorem sheds new light on the mysterious reappearance of the Cheng-Yau metric in optimization theory. Let us understand it better with examples.

3 Brownian motion examples

3.1 Positive orthant

Denoted by $\mathbb{R}^{n}_{+}=\{x\in\mathbb{R}^{n}|x^{i}>0,i=1,...,n\}$ the positive orthant. The canonical barrier and its Hessian metric are

[TABLE]

Then for each choice of coordinate, we have the identity in law

[TABLE]

where $\{B^{i}_{t}\}_{i=1}^{n}$ are $n$ independent standard Brownian motion on $\mathbb{R}$ .

3.2 Cube

Next we consider a convex set, the cube $B_{n}=(0,1)^{n}$ . We find that

[TABLE]

Similar calculations yield the identity in law

[TABLE]

where $\{B^{i}_{t}\}_{i=1}^{n}$ are $n$ independent standard Brownian motions on $\mathbb{R}$ .

3.3 Lorentz cone

A deeper example is provided by the Lorentz cone

[TABLE]

where $x=(x^{0},\ldots,x^{n})$ . The canonical barrier $F$ on $K_{n+1}$ is given by

[TABLE]

The metric $g$ , its inverse, and volume form are as follows

[TABLE]

where $B=A^{-1}$ is the inverse of $A$ . In our case we have $B=A$ , but these matrices are conceptually distinct.

We characterize Brownian motion on $K_{n+1}$ using the auxiliary functions

[TABLE]

Here we have introduced the light-cone

[TABLE]

Pick $n$ vectors $\{b_{i}\}_{i=1}^{n}\subset L_{n+1}^{+}$ and define the $n+1$ functions

[TABLE]

We choose a drift and covariance tensor as follows

[TABLE]

Finally, set $\Sigma^{00}=1$ and $\Sigma^{ij}=0$ when exactly one of the indices is zero.

Theorem 3.1

Denote by $\boldsymbol{B}_{t}$ Brownian motion on $(K_{n+1},g)$ with $\beta=2$ . The stochastic processes $f^{i}_{t}:=f^{i}(\boldsymbol{B}_{t})$ satisfy the Itô SDE

[TABLE]

In particular, each $f^{i}_{t}$ is itself identical in law with a Brownian motion with constant drift.

Proof

We only need to check that

[TABLE]

First, when $i=j=0$ , this is just the claim of Theorem 2.1. When exactly one of the indices is zero, we use equation (15) and the fact that $\frac{(b^{T}x)^{2}}{x^{T}Ax}$ is a homogeneous polynomial of order [math] to obtain

[TABLE]

Finally, consider the case when both $i$ and $j$ are space-like. We start with the following property: for $b,b^{\prime}\in L_{n+1}^{+}$ ,

[TABLE]

Using the fact that $\log(b^{T}x)=\frac{1}{\sqrt{n+1}}(f_{b}-f^{0})$ and $\nabla f_{b}\cdot\nabla f^{0}=0$ , we have

[TABLE]

In particular, we find that $|\nabla f_{b}|^{2}=1$ because $b^{T}Bb=0$ when $b\in L_{n+1}^{+}$ .

The proof of the first identity in equation (31) is a computation:

[TABLE]

Thus, finally we have

[TABLE]

4 Acknowledgements

This work was supported by NSF grant DMS-2107205.

Bibliography13

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Arora, S., Cohen, N., Hazan, E.: On the optimization of deep networks: Implicit acceleration by overparameterization. In: International Conference on Machine Learning. pp. 244–253. PMLR (2018)
2[2] Bah, B., Rauhut, H., Terstiege, U., Westdickenberg, M.: Learning deep linear neural networks: Riemannian gradient flows and convergence to global minimizers. Inf. Inference 11 (1), 307–353 (2022). https://doi.org/10.1093/imaiai/iaaa 039, https://doi.org/10.1093/imaiai/iaaa 039 · doi ↗
3[3] Bayer, D., Lagarias, J.C.: Karmarkar’s linear programming algorithm and Newton’s method. Mathematical Programming 50 , 291–330 (1991)
4[4] Bronstein, M.M., Bruna, J., Le Cun, Y., Szlam, A., Vandergheynst, P.: Geometric deep learning: going beyond Euclidean data. IEEE Signal Processing Magazine 34 (4), 18–42 (2017)
5[5] Cohen, N., Menon, G., Veraszto, Z.: Deep linear networks for matrix completion – an infinite depth limit (2022). https://doi.org/10.48550/ARXIV.2210.12497, https://arxiv.org/abs/2210.12497
6[6] Güler, O.: Barrier functions in interior point methods. Mathematics of Operations Research 21 (4), 860–885 (1996)
7[7] Hildebrand, R.: Conic optimization: affine geometry of self-concordant barriers and copositive cones. Habilitation à diriger des recherches, Université Grenoble Alpes (Jul 2017), https://hal.science/tel-01570016
8[8] Hsu, E.P.: Stochastic analysis on manifolds, Graduate Studies in Mathematics, vol. 38. American Mathematical Society, Providence, RI (2002). https://doi.org/10.1090/gsm/038, https://doi.org/10.1090/gsm/038 · doi ↗