The Riemannian Langevin equation and conic programs
Govind Menon, Tianmin Yu

TL;DR
This paper introduces the Riemannian Langevin equation (RLE) as a generalization of stochastic gradient methods to Riemannian manifolds, providing explicit formulas for Brownian motion on cones, advancing understanding of stochastic processes in geometric spaces.
Contribution
It formulates the Riemannian Langevin equation and derives explicit formulas for Brownian motion on fundamental cones, expanding stochastic analysis on manifolds.
Findings
Explicit formulas for Brownian motion on cones
Generalization of Langevin dynamics to Riemannian manifolds
Framework for analyzing stochastic processes on geometric spaces
Abstract
Diffusion limits provide a framework for the asymptotic analysis of stochastic gradient descent (SGD) schemes used in machine learning. We consider an alternative framework, the Riemannian Langevin equation (RLE), that generalizes the classical paradigm of equilibration in R^n to a Riemannian manifold (M^n, g). The most subtle part of this equation is the description of Brownian motion on (M^n, g). Explicit formulas are presented for some fundamental cones.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMarkov Chains and Monte Carlo Methods · Mathematical Biology Tumor Growth · Advanced Thermodynamics and Statistical Mechanics
11institutetext: Division of Applied Mathematics,
Brown University, Providence RI 02912, USA
11email: [email protected], [email protected]
The Riemannian Langevin equation and conic programs
Govind Menon
Tianmin Yu
Abstract
Diffusion limits provide a framework for the asymptotic analysis of stochastic gradient descent (SGD) schemes used in machine learning. We consider an alternative framework, the Riemannian Langevin equation (RLE), that generalizes the classical paradigm of equilibration in to a Riemannian manifold . The most subtle part of this equation is the description of Brownian motion on . Explicit formulas are presented for some fundamental cones.
Keywords:
Stochastic gradient descent Riemannian Langevin equation.
1 Introduction
1.1 Stochastic gradient descent
Stochastic gradient descent (SGD) schemes in machine learning typically arise as follows. An empirical loss function , for a training parameter , is defined through a finite sum , where denotes a loss function evaluated on a finite set of training data . The loss function is minimized using the stochastic gradient descent scheme
[TABLE]
where is chosen randomly from the set and is a time step.
Several variants of SGD have been explored since the classic work of Robbins and Monro [13]. What is different in modern maching learning is the large size of and and the protocol for the learning rate . Diffusion limits of SGD replace the discrete iteration above with stochastic differential equations (SDE); these SDE depend on the manner in which , and . Some examples of this approach are the stochastic modified equation (SME) proposed in [10], the variational analysis using Kullback-Leibler divergence proposed in [11], and homogenized SGD (HSGD) defined in [12].
1.2 Riemannian Langevin equation
SDE limits of SGD schemes begin with an algorithm and study its scaling limits. The approach in this paper is different. We begin with diffusions that extend the classical Langevin equation to a Riemannian setting. The relation to optimization lies in the nature of the underlying Riemannian geometry. Let us first explain the model; we then explain why it is a natural extension of ideas used in classical and modern optimization theory.
Recall that the Langevin equation associated to the loss function , at inverse temperature , is formulated mathematically as the Itô SDE
[TABLE]
where denotes standard Brownian motion on . Given an -dimensional Riemannian manifold with metric and a loss function we consider the Riemannian Langevin equation (RLE)
[TABLE]
Both the gradient and the Brownian motion at inverse temperature are now computed with respect to the Riemannian metric . In particular, the Brownian motion on must be defined carefully as discussed below. Let be a test function. The infinitesimal generator of the diffusion (3) is
[TABLE]
is the Laplace-Beltrami operator, denotes the gradient with respect to , and the volume form is computed in coordinates using .
The Fokker-Planck equation, , where the dual is with respect to the volume form of , takes the form
[TABLE]
The free energy, , is constant in equilibrium and we find the Gibbs density
[TABLE]
RLE is a method a method to study the Gibbs measure associated to , whereas SGD schemes seek the minimum of . However, these techniques are closely related. When and has a unique global minimum at , the Gibbs measure concentrates at as with rigorous asymptotics provided by large deviations theory. A subtle feature of the metrics arising in optimization is that the volume may be infinite.
1.3 Riemannian geometries in optimization
The framework of RLE provides a natural geometric unity between conic programs and deep learning. What changes is the underlying Riemannian manifold . Let us explain this idea through examples.
Bayer and Lagarias systematized the Riemannian geometry discovered by Karmarkar for interior-point methods [3, 9]. We focus on the canonical barrier [7]. Associated to every regular convex cone is a unique convex function defined in the interior of such that as . This function is the Cheng-Yau solution to the Monge-Ampère equation
[TABLE]
Given a barrier and a vector , the conic program is solved by taking the limit of the central path
[TABLE]
Further, above is the solution to the Riemannian gradient flow
[TABLE]
That is, the Hessian of the barrier provides the underlying Riemannian metric. The canonical barrier has several striking geometric properties [7].
Riemannian metrics have also been extensively used in geometric deep learning [4]. A model problem that allows a comparison between deep learning and classical optimization is the deep linear network [1, 2, 5]. The training space for a network of depth is the product space of matrices . Given the observable is the product . Learning problems like matrix completion may be modeled as a Euclidean gradient descent for a cost function . Then for suitable initial conditions, the Euclidean gradient flow, , corresponds to the Riemannian gradient flow, , where the metric acts by
[TABLE]
In order to explore the nature of the Riemannian Langevin equation in optimization, we must understand Brownian motion on Riemannian manifolds like those above. This is a problem of some depth. We illustrate this by computing explicit expressions for Brownian motion in some fundamental cones, using expressions for the barrier from [6].
2 Brownian motion and conic programs
Manifold-valued Brownian motion may be defined in several ways [8]. We use the following definition in this note: an -valued semimartingale is called a Brownian motion on , with temperature , if for any ,
[TABLE]
We denote Brownian motion on at temperature by .
Proposition 1
The quadratic variation process of for is
[TABLE]
where .
Corollary 1
The covariation process of and for is
[TABLE]
where .
Proposition 1 allows us to analyze Brownian motion through a careful choice of coordinate functions. We will choose such that and , so that has the same law as an -valued standard Brownian motion.
Let us now assume is a regular convex cone , let denote its canonical barrier, and equip with the Hessian metric
[TABLE]
Theorem 2.1
Consider Brownian motion on . The process has the same law as a standard Brownian motion on the line.
Proof
We will use the logarithmic homogeneity of and the Monge-Ampère equation to establish the identities
[TABLE]
Theorem 2.1 follows immediately from these identities.
The first identity uses logarithmic homogeneity. Since for , , we may differentiate with respect to and set to find
[TABLE]
Next, the differential of with respect to is by equation (14). Thus, each component of the gradient of is
[TABLE]
This proves the first identity in equation (15). It immediately follows that
[TABLE]
Finally, we show that as follows
[TABLE]
where we have used the Monge-Ampére equation (7) and equation (18).
The above theorem sheds new light on the mysterious reappearance of the Cheng-Yau metric in optimization theory. Let us understand it better with examples.
3 Brownian motion examples
3.1 Positive orthant
Denoted by the positive orthant. The canonical barrier and its Hessian metric are
[TABLE]
Then for each choice of coordinate, we have the identity in law
[TABLE]
where are independent standard Brownian motion on .
3.2 Cube
Next we consider a convex set, the cube . We find that
[TABLE]
Similar calculations yield the identity in law
[TABLE]
where are independent standard Brownian motions on .
3.3 Lorentz cone
A deeper example is provided by the Lorentz cone
[TABLE]
where . The canonical barrier on is given by
[TABLE]
The metric , its inverse, and volume form are as follows
[TABLE]
where is the inverse of . In our case we have , but these matrices are conceptually distinct.
We characterize Brownian motion on using the auxiliary functions
[TABLE]
Here we have introduced the light-cone
[TABLE]
Pick vectors and define the functions
[TABLE]
We choose a drift and covariance tensor as follows
[TABLE]
[TABLE]
Finally, set and when exactly one of the indices is zero.
Theorem 3.1
Denote by Brownian motion on with . The stochastic processes satisfy the Itô SDE
[TABLE]
In particular, each is itself identical in law with a Brownian motion with constant drift.
Proof
We only need to check that
[TABLE]
First, when , this is just the claim of Theorem 2.1. When exactly one of the indices is zero, we use equation (15) and the fact that is a homogeneous polynomial of order [math] to obtain
[TABLE]
Finally, consider the case when both and are space-like. We start with the following property: for ,
[TABLE]
Using the fact that and , we have
[TABLE]
In particular, we find that because when .
The proof of the first identity in equation (31) is a computation:
[TABLE]
Thus, finally we have
[TABLE]
4 Acknowledgements
This work was supported by NSF grant DMS-2107205.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Arora, S., Cohen, N., Hazan, E.: On the optimization of deep networks: Implicit acceleration by overparameterization. In: International Conference on Machine Learning. pp. 244–253. PMLR (2018)
- 2[2] Bah, B., Rauhut, H., Terstiege, U., Westdickenberg, M.: Learning deep linear neural networks: Riemannian gradient flows and convergence to global minimizers. Inf. Inference 11 (1), 307–353 (2022). https://doi.org/10.1093/imaiai/iaaa 039, https://doi.org/10.1093/imaiai/iaaa 039 · doi ↗
- 3[3] Bayer, D., Lagarias, J.C.: Karmarkar’s linear programming algorithm and Newton’s method. Mathematical Programming 50 , 291–330 (1991)
- 4[4] Bronstein, M.M., Bruna, J., Le Cun, Y., Szlam, A., Vandergheynst, P.: Geometric deep learning: going beyond Euclidean data. IEEE Signal Processing Magazine 34 (4), 18–42 (2017)
- 5[5] Cohen, N., Menon, G., Veraszto, Z.: Deep linear networks for matrix completion – an infinite depth limit (2022). https://doi.org/10.48550/ARXIV.2210.12497, https://arxiv.org/abs/2210.12497
- 6[6] Güler, O.: Barrier functions in interior point methods. Mathematics of Operations Research 21 (4), 860–885 (1996)
- 7[7] Hildebrand, R.: Conic optimization: affine geometry of self-concordant barriers and copositive cones. Habilitation à diriger des recherches, Université Grenoble Alpes (Jul 2017), https://hal.science/tel-01570016
- 8[8] Hsu, E.P.: Stochastic analysis on manifolds, Graduate Studies in Mathematics, vol. 38. American Mathematical Society, Providence, RI (2002). https://doi.org/10.1090/gsm/038, https://doi.org/10.1090/gsm/038 · doi ↗
