A Latent Variational Framework for Stochastic Optimization

Philippe Casgrain

arXiv:1905.01707·cs.LG·October 29, 2019

A Latent Variational Framework for Stochastic Optimization

Philippe Casgrain

PDF

Open Access

TL;DR

This paper introduces a unifying latent variational framework for stochastic optimization, linking it to Bayesian inference and FBSDEs, which encompasses many existing adaptive gradient methods.

Contribution

It presents a novel theoretical framework connecting stochastic optimization algorithms with Bayesian inference and stochastic control, unifying various methods under one approach.

Findings

01

Framework recovers existing adaptive stochastic gradient methods.

02

Establishes a connection between optimization algorithms and Bayesian inference.

03

Uses FBSDEs to analyze and derive stochastic optimization procedures.

Abstract

This paper provides a unifying theoretical framework for stochastic optimization algorithms by means of a latent stochastic variational problem. Using techniques from stochastic control, the solution to the variational problem is shown to be equivalent to that of a Forward Backward Stochastic Differential Equation (FBSDE). By solving these equations, we recover a variety of existing adaptive stochastic gradient descent methods. This framework establishes a direct connection between stochastic optimization algorithms and a secondary Bayesian inference problem on gradients, where a prior measure on noisy gradient observations determines the resulting algorithm.

Equations159

f (x) = \frac{1}{∣ N ∣} z \in N \sum ℓ (x; z),

f (x) = \frac{1}{∣ N ∣} z \in N \sum ℓ (x; z),

g_{t} = \frac{1}{∣ N _{t}^{m} ∣} z \in N_{t}^{m} \sum \nabla_{x} ℓ (x_{t}; z),

g_{t} = \frac{1}{∣ N _{t}^{m} ∣} z \in N_{t}^{m} \sum \nabla_{x} ℓ (x_{t}; z),

ν \in A := {ω = (ω_{t})_{t \geq 0} : ω is F -adapted, E \int_{0}^{T} ∥ ω_{t} ∥^{2} + ∥ \nabla f (X_{t}^{ω}) ∥^{2} d t < \infty} .

ν \in A := {ω = (ω_{t})_{t \geq 0} : ω is F -adapted, E \int_{0}^{T} ∥ ω_{t} ∥^{2} + ∥ \nabla f (X_{t}^{ω}) ∥^{2} d t < \infty} .

D_{h} (y, x) = h (y) - h (x) - ⟨ \nabla h (x), y - x ⟩

D_{h} (y, x) = h (y) - h (x) - ⟨ \nabla h (x), y - x ⟩

L (t, X, ν) = e^{γ_{t}} (Kinetic Energy \int e^{α_{t}} D_{h} (X + e^{- α_{t}} ν, X) - Potential Energy \int e^{β_{t}} (f (X) - f (x^{⋆}))),

L (t, X, ν) = e^{γ_{t}} (Kinetic Energy \int e^{α_{t}} D_{h} (X + e^{- α_{t}} ν, X) - Potential Energy \int e^{β_{t}} (f (X) - f (x^{⋆}))),

\mathcal{J}(\nu)=\mathbb{E}{\Big{[}}\;\underbrace{\int_{0}^{T}\mathcal{L}(t,X_{t}^{\nu},\nu_{t})\,dt}_{\text{Total Path Energy}}+\underbrace{\vphantom{\int_{0}}e^{\delta_{T}}\left(\vphantom{\sum}f(X_{T}^{\nu})-f(x^{\star})\right)}_{\text{Soft End Point Constraint}}{\Big{]}}\;,

\mathcal{J}(\nu)=\mathbb{E}{\Big{[}}\;\underbrace{\int_{0}^{T}\mathcal{L}(t,X_{t}^{\nu},\nu_{t})\,dt}_{\text{Total Path Energy}}+\underbrace{\vphantom{\int_{0}}e^{\delta_{T}}\left(\vphantom{\sum}f(X_{T}^{\nu})-f(x^{\star})\right)}_{\text{Soft End Point Constraint}}{\Big{]}}\;,

ν^{*} = ar g ν \in A min J (ν) .

ν^{*} = ar g ν \in A min J (ν) .

\tilde{L} (t, X, ν) = e^{γ_{t}} (e^{α_{t}} D_{h} (X + e^{- α_{t}} ν, X) - e^{β_{t}} f (X))

\tilde{L} (t, X, ν) = e^{γ_{t}} (e^{α_{t}} D_{h} (X + e^{- α_{t}} ν, X) - e^{β_{t}} f (X))

d\left(\frac{\partial\mathcal{L}}{\partial\nu}\right)_{t}=\mathbb{E}\left[\left(\frac{\partial\mathcal{L}}{\partial X}\right)_{t}{\Big{\lvert}}{\mathcal{F}}_{t}\right]\,dt+d{\mathcal{M}}_{t}\;\;\forall t<T\,,\;\;\left(\frac{\partial\mathcal{L}}{\partial\nu}\right)_{T}=-e^{\delta_{T}}\,\mathbb{E}\left[\nabla f(X_{T}){\Big{\lvert}}{\mathcal{F}}_{T}\right]\;,

d\left(\frac{\partial\mathcal{L}}{\partial\nu}\right)_{t}=\mathbb{E}\left[\left(\frac{\partial\mathcal{L}}{\partial X}\right)_{t}{\Big{\lvert}}{\mathcal{F}}_{t}\right]\,dt+d{\mathcal{M}}_{t}\;\;\forall t<T\,,\;\;\left(\frac{\partial\mathcal{L}}{\partial\nu}\right)_{T}=-e^{\delta_{T}}\,\mathbb{E}\left[\nabla f(X_{T}){\Big{\lvert}}{\mathcal{F}}_{T}\right]\;,

(\frac{\partial L}{\partial X})_{t}

(\frac{\partial L}{\partial X})_{t}

\int (\frac{\partial L}{\partial ν})_{t}

{\mathcal{M}}_{t}=\mathbb{E}\left[\int_{0}^{T}\left(\frac{\partial\mathcal{L}}{\partial X}\right)_{u}\,du-e^{\delta_{T}}\,\nabla f(X_{T}){\Big{\lvert}}{\mathcal{F}}_{t}\right]\;.

{\mathcal{M}}_{t}=\mathbb{E}\left[\int_{0}^{T}\left(\frac{\partial\mathcal{L}}{\partial X}\right)_{u}\,du-e^{\delta_{T}}\,\nabla f(X_{T}){\Big{\lvert}}{\mathcal{F}}_{t}\right]\;.

E_{t} = D_{h} (x^{⋆}, X_{t}^{ν^{*}} + e^{- α_{t}} ν_{t}) + e^{β_{t}} (f (X_{t}^{ν^{*}}) - f (x^{⋆})) - [\nabla h (X^{ν^{*}} + e^{- α_{t}} ν), X^{ν^{*}} + e^{- α_{t}} ν]_{t},

E_{t} = D_{h} (x^{⋆}, X_{t}^{ν^{*}} + e^{- α_{t}} ν_{t}) + e^{β_{t}} (f (X_{t}^{ν^{*}}) - f (x^{⋆})) - [\nabla h (X^{ν^{*}} + e^{- α_{t}} ν), X^{ν^{*}} + e^{- α_{t}} ν]_{t},

E [f (X_{t}) - f (x^{⋆})] = O (e^{- β_{t}} max {1, E [[e^{- γ_{t}} M]_{t}]}),

E [f (X_{t}) - f (x^{⋆})] = O (e^{- β_{t}} max {1, E [[e^{- γ_{t}} M]_{t}]}),

d X_{t} = e^{α_{t}} (\nabla h^{*} (\nabla h (X_{t}) - \tilde{Φ}_{t} (1 + ρ^{2})^{- 1} g_{t}) - X_{t}^{ν^{*}}) d t,

d X_{t} = e^{α_{t}} (\nabla h^{*} (\nabla h (X_{t}) - \tilde{Φ}_{t} (1 + ρ^{2})^{- 1} g_{t}) - X_{t}^{ν^{*}}) d t,

X_{t_{k + 1}} = \nabla h^{*} (\nabla h (X_{t_{k}}) - \tilde{Φ}_{t_{k}} g_{t_{k}}),

X_{t_{k + 1}} = \nabla h^{*} (\nabla h (X_{t_{k}}) - \tilde{Φ}_{t_{k}} g_{t_{k}}),

d \overset{y}{^}_{i, t} = - A \overset{y}{^}_{i, t} d t + σ^{- 1} \overset{ˉ}{P}_{t} b d \hat{B}_{i, t}, \dot{\overset{ˉ}{P}} = - A \overset{ˉ}{P}_{t} - \overset{ˉ}{P}_{t}^{⊺} A - σ^{- 2} \overset{ˉ}{P}_{t} b b^{⊺} \overset{ˉ}{P}_{t}^{⊺} + L L^{⊺},

d \overset{y}{^}_{i, t} = - A \overset{y}{^}_{i, t} d t + σ^{- 1} \overset{ˉ}{P}_{t} b d \hat{B}_{i, t}, \dot{\overset{ˉ}{P}} = - A \overset{ˉ}{P}_{t} - \overset{ˉ}{P}_{t}^{⊺} A - σ^{- 2} \overset{ˉ}{P}_{t} b b^{⊺} \overset{ˉ}{P}_{t}^{⊺} + L L^{⊺},

d X_{t} = e^{α_{t}} (\nabla h^{*} (\nabla h (X_{t}) - \sum_{j = 1}^{\tilde{d}} \tilde{Φ}_{j, t} \overset{y}{^}_{\cdot, j, t}) - X_{t}^{ν^{*}}) d t,

d X_{t} = e^{α_{t}} (\nabla h^{*} (\nabla h (X_{t}) - \sum_{j = 1}^{\tilde{d}} \tilde{Φ}_{j, t} \overset{y}{^}_{\cdot, j, t}) - X_{t}^{ν^{*}}) d t,

y_{i, t_{k + 1}} = (I - e^{- α_{t_{k}}} A) y_{i, t_{k}} + L e^{- α_{t}} w_{i, k}, g_{i, t_{k}} = b^{⊺} y_{i, t_{k}} + σ e^{- α_{t}} ξ_{i, k},

y_{i, t_{k + 1}} = (I - e^{- α_{t_{k}}} A) y_{i, t_{k}} + L e^{- α_{t}} w_{i, k}, g_{i, t_{k}} = b^{⊺} y_{i, t_{k}} + σ e^{- α_{t}} ξ_{i, k},

X_{t_{k + 1}} = \nabla h^{*} (\nabla h (X_{t_{k}}) - \sum_{j = 1}^{\tilde{d}} \tilde{Φ}_{j, t_{k}} \overset{y}{^}_{\cdot, j, k}),

X_{t_{k + 1}} = \nabla h^{*} (\nabla h (X_{t_{k}}) - \sum_{j = 1}^{\tilde{d}} \tilde{Φ}_{j, t_{k}} \overset{y}{^}_{\cdot, j, k}),

X_{t_{k + 1}} = \nabla h^{*} (\nabla h (X_{t_{k}}) - \sum_{j = 1}^{\tilde{d}} \tilde{Φ}_{j, t_{k}} \overset{y}{^}_{\cdot, j, k}), \overset{y}{^}_{i, \cdot, k} = (\tilde{A} - K_{\infty} b^{⊺} \tilde{A}) \overset{y}{^}_{i, \cdot, k} + K_{\infty} g_{i, k},

X_{t_{k + 1}} = \nabla h^{*} (\nabla h (X_{t_{k}}) - \sum_{j = 1}^{\tilde{d}} \tilde{Φ}_{j, t_{k}} \overset{y}{^}_{\cdot, j, k}), \overset{y}{^}_{i, \cdot, k} = (\tilde{A} - K_{\infty} b^{⊺} \tilde{A}) \overset{y}{^}_{i, \cdot, k} + K_{\infty} g_{i, k},

X_{t_{k + 1}} - X_{t_{k}} = - \tilde{Φ}_{t_{k}} \overset{y}{^}_{k}, \overset{y}{^}_{i, k} = p_{1} \overset{y}{^}_{k} + p_{2} g_{t_{k}},

X_{t_{k + 1}} - X_{t_{k}} = - \tilde{Φ}_{t_{k}} \overset{y}{^}_{k}, \overset{y}{^}_{i, k} = p_{1} \overset{y}{^}_{k} + p_{2} g_{t_{k}},

p_{t} = (\frac{\partial L}{\partial ν})_{t} = e^{γ_{t}} (\nabla h (X_{t}^{ν^{*}} + e^{- α_{t}} ν^{*}) - \nabla h (X_{t}^{ν^{*}})) .

p_{t} = (\frac{\partial L}{\partial ν})_{t} = e^{γ_{t}} (\nabla h (X_{t}^{ν^{*}} + e^{- α_{t}} ν^{*}) - \nabla h (X_{t}^{ν^{*}})) .

ν^{*} = e^{- α_{t}} (\nabla h^{*} (\nabla h (X_{t}) + e^{- γ_{t}} p_{t}) - X_{t}) .

ν^{*} = e^{- α_{t}} (\nabla h^{*} (\nabla h (X_{t}) + e^{- γ_{t}} p_{t}) - X_{t}) .

\left\{\begin{aligned} dp_{t}&=-\left\{e^{\gamma_{t}+\alpha_{t}+\beta_{t}}\mathbb{E}\left[\nabla f(X_{t}^{\nu^{\ast}}){\big{\lvert}}{\mathcal{F}}_{t}\right]+\left(e^{\gamma_{t}}\nabla^{2}h(X_{t})\,\nu^{\ast}_{t}-e^{\alpha_{t}}p_{t}\right)\right\}\,dt+d\mathcal{M}_{t}\\ p_{T}&=-e^{\delta_{T}}\mathbb{E}\left[\nabla f(X_{T}^{\nu^{\ast}}){\big{\lvert}}{\mathcal{F}}_{T}\right]\end{aligned}\right.

\left\{\begin{aligned} dp_{t}&=-\left\{e^{\gamma_{t}+\alpha_{t}+\beta_{t}}\mathbb{E}\left[\nabla f(X_{t}^{\nu^{\ast}}){\big{\lvert}}{\mathcal{F}}_{t}\right]+\left(e^{\gamma_{t}}\nabla^{2}h(X_{t})\,\nu^{\ast}_{t}-e^{\alpha_{t}}p_{t}\right)\right\}\,dt+d\mathcal{M}_{t}\\ p_{T}&=-e^{\delta_{T}}\mathbb{E}\left[\nabla f(X_{T}^{\nu^{\ast}}){\big{\lvert}}{\mathcal{F}}_{T}\right]\end{aligned}\right.

d X_{t}^{ν^{*}} = e^{α_{t}} (\nabla h^{*} (\nabla h (X_{t}^{ν^{*}}) + e^{- γ_{t}} p_{t}) - X_{t}^{ν^{*}}) d t .

d X_{t}^{ν^{*}} = e^{α_{t}} (\nabla h^{*} (\nabla h (X_{t}^{ν^{*}}) + e^{- γ_{t}} p_{t}) - X_{t}^{ν^{*}}) d t .

p_{t}=\mathbb{E}\left[\int_{t}^{T}e^{\gamma_{u}}\left\{e^{\alpha_{u}+\beta_{u}}\nabla f(X_{u}^{\nu^{\ast}})+\left(\nabla^{2}h(X_{u})\,\nu^{\ast}_{u}-e^{\alpha_{u}-\gamma_{u}}p_{u}\right)\right\}\,du\,-e^{\delta_{T}}\nabla f(X_{T}^{\nu^{\ast}})\;{\Big{\lvert}}{\mathcal{F}}_{t}\right]\;,

p_{t}=\mathbb{E}\left[\int_{t}^{T}e^{\gamma_{u}}\left\{e^{\alpha_{u}+\beta_{u}}\nabla f(X_{u}^{\nu^{\ast}})+\left(\nabla^{2}h(X_{u})\,\nu^{\ast}_{u}-e^{\alpha_{u}-\gamma_{u}}p_{u}\right)\right\}\,du\,-e^{\delta_{T}}\nabla f(X_{T}^{\nu^{\ast}})\;{\Big{\lvert}}{\mathcal{F}}_{t}\right]\;,

\nabla^{2} h (X_{t}) ν_{t}^{*} - e^{α_{t} - γ_{t}} p_{t} = \nabla^{2} h (X_{t}) ν_{t}^{*} - (\frac{\nabla h ( X _{t} + e ^{- α_{t}} ν _{t}^{*} ) - \nabla h ( X _{t} )}{e ^{- α_{t}}}),

\nabla^{2} h (X_{t}) ν_{t}^{*} - e^{α_{t} - γ_{t}} p_{t} = \nabla^{2} h (X_{t}) ν_{t}^{*} - (\frac{\nabla h ( X _{t} + e ^{- α_{t}} ν _{t}^{*} ) - \nabla h ( X _{t} )}{e ^{- α_{t}}}),

\left\{\begin{aligned} &d\tilde{p}^{\scaleto{\mathstrut(0)}{4.5pt}}_{t}=-e^{\gamma_{t}+\alpha_{t}+\beta_{t}}\,\mathbb{E}\left[\nabla f\left(X_{t}\right)\lvert{\mathcal{F}}_{t}\right]\,dt+d\tilde{\mathcal{M}}^{\scaleto{\mathstrut(0)}{4.5pt}}_{t}\\ &\tilde{p}^{\scaleto{\mathstrut(0)}{4.5pt}}_{T}=-e^{\delta_{T}}\mathbb{E}\left[\nabla f(X_{T}^{\nu^{\ast}}){\big{\lvert}}{\mathcal{F}}_{T}\right]\end{aligned}\right.\;,

\left\{\begin{aligned} &d\tilde{p}^{\scaleto{\mathstrut(0)}{4.5pt}}_{t}=-e^{\gamma_{t}+\alpha_{t}+\beta_{t}}\,\mathbb{E}\left[\nabla f\left(X_{t}\right)\lvert{\mathcal{F}}_{t}\right]\,dt+d\tilde{\mathcal{M}}^{\scaleto{\mathstrut(0)}{4.5pt}}_{t}\\ &\tilde{p}^{\scaleto{\mathstrut(0)}{4.5pt}}_{T}=-e^{\delta_{T}}\mathbb{E}\left[\nabla f(X_{T}^{\nu^{\ast}}){\big{\lvert}}{\mathcal{F}}_{T}\right]\end{aligned}\right.\;,

\tilde{p}_{t}^{\scaleto ((0) 4.5 pt} = E [\int_{t}^{T} e^{γ_{u} + α_{u} + β_{u}} \nabla f (X_{u}) d u - e^{δ_{T}} \nabla f (X_{T}^{ν^{*}}) F_{t}],

\tilde{p}_{t}^{\scaleto ((0) 4.5 pt} = E [\int_{t}^{T} e^{γ_{u} + α_{u} + β_{u}} \nabla f (X_{u}) d u - e^{δ_{T}} \nabla f (X_{T}^{ν^{*}}) F_{t}],

\tilde{p}^{\scaleto{\mathstrut(0)}{4.5pt}}_{t}=\mathbb{E}\left[\int_{t}^{T}e^{\gamma_{u}+\alpha_{u}+\beta_{u}}\,\nabla f\left(X_{u}\right)\,du-e^{\delta_{T}}\,\nabla f(X_{T}^{\nu^{\ast}}){\Big{\lvert}}{\mathcal{F}}_{u}\right]\;.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis · Advanced Bandit Algorithms Research

Full text

A Latent Variational Framework for Stochastic Optimization

Philippe Casgrain

Department of Statistical Sciences

University of Toronto

Toronto, ON, Canada

[email protected]

Abstract

This paper provides a unifying theoretical framework for stochastic optimization algorithms by means of a latent stochastic variational problem. Using techniques from stochastic control, the solution to the variational problem is shown to be equivalent to that of a Forward Backward Stochastic Differential Equation (FBSDE). By solving these equations, we recover a variety of existing adaptive stochastic gradient descent methods. This framework establishes a direct connection between stochastic optimization algorithms and a secondary latent inference problem on gradients, where a prior measure on gradient observations determines the resulting algorithm.

1 Introduction

Stochastic optimization algorithms are tools which are crucial to solving optimization problems arising in machine learning. The initial motivation for these algorithms comes from the fact that computing the gradients of a target loss function becomes increasingly difficult as the scale and dimension of an optimization problem grows larger. In these large-scale optimization problems, deterministic gradient-based optimization algorithms perform poorly due to the computational load of repeatedly computing gradients. Stochastic optimization algorithms remedy this issue by replacing exact gradients of the target loss with a computationally cheap gradient estimator, trading off noise in gradient estimates for computational efficiency at each step.

To illustrate this idea, consider the problem of minimizing a generic risk function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , taking the form

[TABLE]

where $\ell:\mathbb{R}^{d}\times\mathcal{Z}\rightarrow\mathbb{R}$ , and where we define the set ${\mathfrak{N}}:=\{z_{i}\in\mathcal{Z}\;,\;i=1,\dots,N\}$ to be a set of training points. In this definition, we interpret $\ell(x;z)$ as the model loss at a single training point $z\in{\mathfrak{N}}$ for the parameters $x\in\mathbb{R}^{d}$ .

When $N$ and $d$ are typically large, computing the gradients of $f$ can be time-consuming. Knowing this, let us consider the path of an optimization algorithm as given by $\{x_{t}\}_{t\in\mathbb{N}}$ . Rather than computing $\nabla f(x_{t})$ directly at each point of the optimization process, we may instead collect noisy samples of gradients as

[TABLE]

where for each $t$ , ${\mathfrak{N}}_{t}^{m}\subseteq{\mathfrak{N}}$ is an independent sample of size $m$ from the set of training points. We assume that $m\ll N$ is chosen small enough so that $g_{t}$ can be computed at a significantly lower cost than $\nabla f(x_{t})$ . Using the collection of noisy gradients $\{g_{t}\}_{t\in\mathbb{N}}$ , stochastic optimization algorithms construct an estimator $\widehat{\nabla f}(x_{t})$ of the gradient $\nabla f(x_{t})$ in order to determine the next step $x_{t+1}$ of the optimizer.

This paper presents a theoretical framework which provides new perspectives on stochastic optimization algorithms, and explores the implicit model assumptions that are made by existing ones. We achieve this by extending the approach taken by Wibisono et al. (2016) to stochastic algorithms. The key step in our approach is to interpret the task of optimization with a stochastic algorithm as a latent variational problem. As a result, we can recover algorithms from this framework which have built-in online learning properties. In particular, these algorithms use an online Bayesian filter on the stream of noisy gradient samples, $g_{t}$ , to compute estimates of $\nabla f(x_{t})$ . Under various model assumptions on $\nabla f$ and $g$ , we recover a number of common stochastic optimization algorithms.

1.1 Related Work

There is a rich literature on stochastic optimization algorithms as a consequence of their effectiveness in machine learning applications. Each algorithm introduces its own variation on the gradient estimator ${\widehat{\nabla f}(x_{t})}$ as well as other features which can improve the speed of convergence to an optimum. Amongst the simplest of these is stochastic gradient descent and its variants Robbins and Monro (1951), which use an estimator based on single gradient samples. Others, such as Lucas et al. (2018); Nesterov , use momentum and acceleration as features to enhance convergence, and can be interpreted as using exponentially weighted moving averages as gradient estimators. Adaptive gradient descent methods such as AdaGrad from Duchi et al. (2011) and Adam from Kingma and Ba (2014) use similar moving average estimators, as well as dynamically updated normalization factors. For a survey paper which covers many modern stochastic optimization methods, see Ruder (2016).

There exist a number of theoretical interpretations of various aspects of stochastic optimization. Cesa-Bianchi et al. (2004) have shown a parallel between stochastic optimization and online learning. Some previous related works, such as Gupta et al. (2017) provide a general model for adaptive methods, generalizing the subgradient projection approach of Duchi et al. (2011). Aitchison (2018) use a Bayesian model to explain the various features of gradient estimators used in stochastic optimization algorithms . This paper differs from these works by naturally generating stochastic algorithms from a variational principle, rather than attempting to explain their individual features. This work is most similar to that of Wibisono et al. (2016) who provide a variational model for continuous deterministic optimization algorithms.

There is a large body of research on continuous-time approximations to deterministic optimization algorithms via dynamical systems (ODEs) (Su et al. (2014); Krichene et al. (2015); Wilson et al. (2016); da Silva and Gazeau (2018)), as well as approximations to stochastic optimization algorithms by stochastic differential equations (SDEs) (Xu et al. (2018a, b); Raginsky and Bouvrie (2012); Mertikopoulos and Staudigl (2018); Krichene and Bartlett (2017)). In particular, the most similar of these works, Raginsky and Bouvrie (2012); Xu et al. (2018a, b), study continuous approximations to stochastic mirror descent by adding exogenous Brownian noise to the continuous dynamics derived in Wibisono et al. (2016). This work differs by deriving continuous stochastic dynamics for optimizers from a broader theoretical framework, rather than positing the continuous dynamics as-is. Although the equations studied in these papers may resemble some of the results derived in this one, they differ in a number of ways. Firstly, this paper finds that the source of randomness present in the optimizer dynamics obtained in this paper are not generated by an exogenous source of noise, but are in fact an explicit function of the randomness generated by observed stochastic gradients during the optimization process. Another important difference is that the optimizer dynamics presented in this paper make no use of the gradients of the objective function, $\nabla f$ (which is inaccessible to a stochastic optimizer), and are only a function of the stream of stochastic gradients $g_{t}$ .

1.2 Contribution

To the author’s knowledge, this is the first paper to produce a theoretical model for stochastic optimization based on a variational interpretation. This paper extends the continuous variational framework Wibisono et al. (2016) to model stochastic optimization. From this model, we derive optimality conditions in the form of a system of forward-backward stochastic differential equations (FBSDEs), and provide bounds on the expected rate of convergence of the resulting optimization algorithm to the optimum. By discretizing solutions of the continuous system of equations, we can recover a number of well-known stochastic optimization algorithms, demonstrating that these algorithms can be obtained as solutions of the variational model under various assumptions on the loss function, $f(x)$ , that is being minimized.

1.3 Paper Structure

In Section 2 we define a continuous-time surrogate model of stochastic optimization. Section 3 uses this model to motivate a stochastic variational problem over optimizers, in which we search for stochastic optimization algorithms which achieve optimal average performance over a collection of minimization problems. In Section 4 we show that the necessary and sufficient conditions for optimality of the variational problem can be expressed as a system of Forward-Backward Stochastic Differential Equations. Theorem 4.2 provides rates of convergence for the optimal algorithm to the optimum of the minimization problem. Lastly, Section 5 recovers SGD, mirror descent, momentum, and other optimization algorithms as discretizations of the continuous optimality equations derived in Section 4 under various model assumptions. The proofs of the mathematical results of this paper are found within the appendices.

2 A Statistical Model for Stochastic Optimization

Over the course of the section, we present a variational model for stochastic optimization. The ultimate objective will be to construct a framework for measuring the average performance of an algorithm over a random collection of optimization problems. We define random variables in an ambient probability space $\smash{(\Omega,{\mathbb{P}},\mathfrak{G}=\{{\mathcal{G}}_{t}\}_{t\in[0,T]})}$ , where ${\mathcal{G}}_{t}$ is a filtration which we will define at a later point in this section. We assume that loss functions are drawn from a random variable $f:\Omega\rightarrow C^{1}(\mathbb{R}^{d})$ . Each draw from the random variable satisfies $f(x)\in\mathbb{R}$ for fixed $x\in\mathbb{R}^{d}$ , and $f$ is assumed to be an almost-surely continuously differentiable in $x$ . In addition, we make the technical assumption that $\mathbb{E}\,\lVert\nabla f(x)\rVert^{2}<\infty$ for all $x\in\mathbb{R}^{d}$ .

We define an optimizer $X=(X_{t}^{\nu})_{t\geq 0}$ as a controlled process satisfying $X_{t}^{\nu}\in\mathbb{R}^{d}$ for all $t\geq 0$ , with initial condition $X_{0}\in\mathbb{R}^{d}$ . The paths of $X$ are assumed to be continuously differentiable in time so that the dynamics of the optimizer may be written as $dX_{t}^{\nu}=\nu_{t}\,dt$ , where $\nu_{t}\in\mathbb{R}^{d}$ represents the control, where we use the superscript to express the explicit dependence of $X^{\nu}$ on the control $\nu$ . We may also write the optimizer in its integral form as $X_{t}^{\nu}=X_{0}+\int_{0}^{t}\nu_{u}\,du$ , demonstrating that the optimizer is entirely characterized by a pair $(\nu,X_{0})$ consisting of a control process $\nu$ and an initial condition $X_{0}$ . Using an explicit Euler discretization with step size $\epsilon>0$ , the optimizer can be approximately represented through the update rule $X_{t+\epsilon}^{\nu}\approx X^{\nu}_{t}+\epsilon\,\nu_{t}$ . This leads to the interpretation of $\nu_{t}$ as the (infinitesimal) step the algorithm takes at each point $t$ during the optimization process.

In order to capture the essence of stochastic optimization, we construct our model so that optimizers have restricted access to the gradients of the loss function $f$ . Rather than being able to directly observe $\nabla f$ over the path of $X_{t}^{\nu}$ , we assume that the algorithm may only use a noisy source of gradient samples, modeled by a càdlàg semi-martingale111A càdlàg (continue à droite, limite à gauche) process is a continuous time process that is almost-surely right-continuous with finite left limit at each point t. A semi-martingale is the sum of a process of finite variation and a local martingale. For more information on continuous time stochastic processes and these definitions, see the canonical text Jacod and Shiryaev (2013). $g=\left(g_{t}\right)_{t\geq 0}$ . As a simple motivating example, we can consider the model $g_{t}=\nabla f(X_{t}^{\nu})+\xi_{t}$ , where $\xi_{t}$ is a white noise process. This particular model for the noisy gradient process can be interpreted as consisting of observing $\nabla f(X_{t}^{\nu})$ plus an independent source of noise. This concrete example will be useful to keep in mind to make sense of the results which we present over the course of the paper.

To make the concept of information restriction mathematically rigorous, we restrict ourselves only to optimizers $X^{\nu}$ which are measurable with respect to the information generated by the noisy gradient process $g$ . To do this, we first define the global filtration ${\mathcal{G}}$ , as ${\mathcal{G}}_{t}=\sigma\left((g_{u})_{u\in[0,t]},f\right)$ as the sigma algebra generated by the paths of $g$ as well as the realizations of the loss surface $f$ . The filtration ${\mathcal{G}}_{t}$ is defined so that it contains the complete set of information generating the optimization problem until time $t$ .

Next, we define the coarser filtration ${\mathcal{F}}_{t}=\sigma(g_{u})_{u\in[0,t]}\subset{\mathcal{G}}_{t}$ generated strictly by the paths of the noisy gradient process. This filtration represents the total set of information available to the optimizer up until time $t$ . This allows us to formally restrict the flow of information to the algorithm by restricting ourselves to optimizers which are adapted to ${\mathcal{F}}_{t}$ . More precisely, we say that the optimizer’s control $\nu$ is admissible if \useshortskip

[TABLE]

The set of optimizers generated by $\mathcal{A}$ can be interpreted as the set of optimizers which may only use the source of noisy gradients, which have bounded expected travel distance and have square-integrable gradients over their path.

3 The Optimizer’s Variational Problem

Having defined the set of admissible optimization algorithms, we set out to select those which are optimal in an appropriate sense. We proceed similarly to Wibisono et al. (2016), by proposing an objective functional which measures the performance of the optimizer over a finite time period.

The motivation for the optimizer’s performance metric comes from a physical interpretation of the optimization process. We can think of our optimization process as a particle traveling through a potential field define by the target loss function $f$ . As the particle travels through the potential field, it may either gain or lose momentum depending on its location and velocity, which will in turn affect the particle’s trajectory. Naturally, we may seek to find the path of a particle which reaches the optimum of the loss function while minimizing the total amount of kinetic and potential energy that is spent. We therefore turn to the Lagrangian interpretation of classical mechanics, which provides a framework for obtaining solutions to this problem. Over the remainder of this section, we lay out the Lagrangian formalism for the optimization problem we defined in Section 2.

To define a notion of energy in the optimization process, we provide a measure of distance in the parameter space. We use the Bregman Divergence as the measure of distance within our parameter space, which can embed additional information about the geometry of the optimization problem. The Bregman divergence, $D_{h}$ , is defined as

[TABLE]

where $h:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is a strictly convex function satisfying $h\in C^{2}$ . We assume here that the gradients of $h$ are $L$ -Lipschitz smooth for a fixed constant $L>0$ . The choice of $h$ determines the way we measure distance, and is typically chosen so that it mimics features of the loss function $f$ . In particular, this quantity plays a central role in mirror descent and non-linear sub-gradient algorithms. For more information on this connection and on Bregman Divergence, see Nemirovsky and Yudin (1983) and Beck and Teboulle (2003).

We define the total energy in our problem as the kinetic energy, accumulated through the movement of the optimizer, and the potential energy generated by the loss function $f$ . Under the assumption that $f$ almost surely admits a global minimum $x^{\star}=\arg\min_{x\in\mathbb{R}^{d}}f(x)$ , we may represent the total energy via the Bregman Lagrangian as

[TABLE]

for fixed inputs $(t,X,\nu)$ , and where we assume that $\gamma,\alpha,\beta:\mathbb{R}^{+}\rightarrow\mathbb{R}$ are deterministic, and satisfy $\gamma,\alpha,\beta\in C^{1}$ . The functions $\gamma,\alpha,\beta$ can be interpreted as hyperparameters which tune the energy present at any state of the optimization process. An important property to note is that the Lagrangian is itself a random variable due to the randomness introduced by the latent loss function $f$ .

The objective is then to find an optimizer within the admissible set $\mathcal{A}$ which can get close to the minimum $x^{\star}=\min_{x\in\mathbb{R}^{d}}f(x)$ , while simultaneously minimizing the energy cost over a finite time period $[0,T]$ . The approach taken in classical mechanics and in Wibisono et al. (2016) fixes the endpoint of the optimizer at $x^{\star}$ . Since we assume that the function $f$ is not directly visible to our optimizer, it is not possible to add a constraint of this type that will hold almost surely. Instead, we introduce a soft constraint which penalizes the algorithm’s endpoint in proportion to its distance to the global minimum, $f(X_{T})-f(x^{\star})$ . As such, we define the expected action functional $\mathcal{J}:\mathcal{A}\rightarrow\mathbb{R}$ as

[TABLE]

where $\delta_{T}\in C^{1}$ is assumed to be an additional model hyperparameter, which controls the strength of the soft constraint.

With this definition in place, the objective will be to select amongst admissible optimizers for those which minimize the expected action. Hence, we seek optimizers which solve the stochastic variational problem

[TABLE]

Remark 1.

Note that the variational problem (7) is identical to the one with Lagrangian

[TABLE]

and terminal penalty $e^{\delta_{T}}f(X_{T}^{\nu})$ , since they differ by constants independent of $\nu$ . Because of this, the results presented in Section 4 also hold the case where $x^{\star}$ and $f(x^{\star})$ do not exist or are infinite.

4 Critical Points of the Expected Action Functional

In order to solve the variational problem (7), we make use techniques from the calculus of variations and infinite dimensional convex analysis to provide optimality conditions for the variational problem (7). To address issues of information restriction, we rely on the stochastic control techniques developed by Casgrain and Jaimungal (2018a, c, b).

The approach we take relies on the fact that a necessary condition for the optimality of a Gâteaux differentiable functional $\mathcal{J}$ is that its Gâteaux derivative vanishes in all directions. Computing the Gâteaux derivative of $\mathcal{J}$ , we find an equivalence between the Gâteaux derivative vanishing and a system of Forward-Backward Stochastic Differential Equations (FBSDEs), yielding a generalization of the Euler-Lagrange equations to the context of our optimization problem. The precise result is stated in Theorem 4.1 below.

Theorem 4.1 (Stochastic Euler-Lagrange Equation).

A control $\nu^{\ast}\in\mathcal{A}$ is a critical point of $\mathcal{J}$ if and only if $((\frac{\partial\mathcal{L}}{\partial\nu}),{\mathcal{M}})$ is a solution to the system of FBSDEs,

[TABLE]

where we define the processes

[TABLE]

and where the process $\mathcal{M}=(\mathcal{M}_{t})_{t\in[0,T]}$ is an ${\mathcal{F}}$ -adapted martingale. As a consequence, if the solution to this FBSDE is unique, then it is the unique critical point of the functional $\mathcal{J}$ up to null sets.

Proof.

See Appendix C ∎

Theorem 4.1 presents an analogue of the Euler-Lagrange equation with free terminal boundary. Rather than obtaining an ODE as in the classical result, we obtain an FBSDE222For a background on FBSDEs, we point readers to Pardoux and Tang (1999); Ma et al. (1999); Carmona (2016). At a high level, the solution to an FBSDE of the form (9) consists of a pair of processes $(\nicefrac{{\partial\mathcal{L}}}{{\partial\nu}},\mathcal{M})$ , which simultaneously satisfy the dynamics and the boundary condition of (9). Intuitively, the martingale part of the solution can be interpreted as a random process which guides $(\nicefrac{{\partial\mathcal{L}}}{{\partial X}})_{t}$ towards the boundary condition at time $T$ . , with backwards process $(\nicefrac{{\partial\mathcal{L}}}{{\partial\nu}})_{t}$ , and forward state processes $\mathbb{E}[(\nicefrac{{\partial\mathcal{L}}}{{\partial X}})_{t}\lvert{\mathcal{F}}_{t}]$ , $\int_{0}^{t}\left\|\nu_{u}\right\|\,du$ and $X_{t}^{\nu^{\ast}}$ . We can also interpret the dynamics of equation (9) as being the filtered optimal dynamics of (Wibisono et al., 2016, Equation 2.3), $\mathbb{E}[(\nicefrac{{\partial\mathcal{L}}}{{\partial X}})_{t}\lvert{\mathcal{F}}_{t}]$ , plus the increments of data-dependent martingale ${\mathcal{M}}_{t}$ , with mechanics similar to that of the ‘innovations process’ of filtering theory. This martingale term should not be interpreted as a source of noise, but as an explicit function of the data, as is evident from its explicit form

[TABLE]

A feature of equation (9), is that optimality relies on the projection of $(\nicefrac{{\partial\mathcal{L}}}{{\partial X}})_{t}$ onto ${\mathcal{F}}_{t}$ . Thus, the optimization algorithm makes use of past noisy gradient observations in order to make local gradient predictions. Local gradient predictions are updated using a Bayesian mechanism, where the prior model for $\nabla f$ is conditioned with the noisy gradient information contained in ${\mathcal{F}}_{t}$ . This demonstrates that the solution depends only on the gradients of $f$ along the path of $X_{t}$ and no higher order properties.

4.1 Expected Rates of Convergence of the Continuous Algorithm

Using the dynamics (9) we obtain a bound on the rate of convergence of the continuous optimization algorithm that is analogous to Wibisono et al. (2016, Theorem 2.1). We introduce the Lyapunov energy functional \useshortskip

[TABLE]

where we define $x^{\star}$ to be a global minimum of $f$ . Under additional model assumptions, and by showing that this quantity is a super-martingale with respect to the filtration ${\mathcal{F}}$ , we obtain an upper bound for the expected rate of convergence from $X_{t}$ towards the minimum.

Theorem 4.2 (Convergence Rate).

Assume that the function $f$ is almost surely convex and that the scaling conditions $\dot{\gamma}_{t}=e^{\alpha_{t}}$ and $\dot{\beta}_{t}\leq e^{\alpha_{t}}$ hold. Moreover, assume that in addition to $h$ having $L$ -Lipschitz smooth gradients, $h$ is also $\mu$ -strongly-convex with $\mu>0$ . Define $x^{\star}=\arg\min_{x\in\mathbb{R}^{d}}f(x)$ to be a global minimum of $f$ . If $x^{\star}$ exists almost surely, the optimizer defined by FBSDE (9) satisfies

[TABLE]

where $\left[e^{-\gamma_{t}}\mathcal{M}\right]_{t}$ represents the quadratic variation of the process $e^{-\gamma_{t}}{\mathcal{M}}_{t}$ , where $\mathcal{M}$ is the martingale part of the solution defined in Theorem 4.1.

Proof.

See Appendix D. ∎

We may interpret the term $\mathbb{E}\left[\,[e^{-\gamma_{t}}{\mathcal{M}}]_{t}\right]$ as a penalty on the rate of convergence, which scales with the amount of noise present in our gradient observations. To see this, note that if there is no noise in our gradient observations, we obtain that ${\mathcal{F}}_{t}={\mathcal{G}}_{t}$ , and hence $\mathcal{M}_{t}\equiv 0$ , which recovers the exact deterministic dynamics of Wibisono et al. (2016) and the optimal convergence rate $O(e^{-\beta_{t}})$ . If the noise in our gradient estimates is large, we can expect $\mathbb{E}\left[\,[e^{-\gamma}{\mathcal{M}}]_{t}\right]$ to grow at quickly and to counteract the shrinking effects of $e^{-\beta_{t}}$ . Thus, in the case of a convex objective function $f$ , any presence of gradient noise will proportionally hurt rate of convergence to an optimum. We also point out, that there will be a nontrivial dependence of $\mathbb{E}\left[\,[e^{-\gamma}{\mathcal{M}}]_{t}\right]$ on all model hyperparameters, the specific definition of the random variable $f$ , and the model for the noisy gradient stream, $(g_{t})_{t\geq 0}$ .

Remark 2.

We do not assume that the conditions of Theorem 4.2 carry throughout the remainder of the paper. In particular, Sections 5 study models which may not guarantee almost-sure convexity of the latent loss function.

5 Recovering Discrete Optimization Algorithms

In this section, we use the optimality equations of Theorem 4.1 to produce discrete stochastic optimization algorithms. The procedure we take is as follows. We first define a model for the processes $(\nabla f(X_{t}),g_{t})_{t\in[0,T]}$ . Second, we solve the optimality FBSDE (9) in closed form or approximate the solution via the first-order singular perturbation (FOSP) technique, as described in Appendix A. Lastly, we discretize the solutions with a simple Forward-Euler scheme in order to recover discrete algorithms.

Over the course of Sections 5.1 and 5.2, we show that various simple models for $(\nabla f(X_{t}),g_{t})_{t\in[0,T]}$ and different specifications of $h$ produce many well-known stochastic optimization algorithms. These establish the conditions, in the context of the variational problem of Section 2, under which each of these algorithms are optimal. As a consequence, this allows us to understand the prior assumptions which these algorithms make on the gradients of the objective function they are trying to minimize, and the way noise is introduced in the sampling of stochastic gradients, $(g_{t})_{t\geq 0}$ .

5.1 Stochastic Gradient Descent and Stochastic Mirror Descent

Here we propose a Gaussian model on gradients which loosely represents the behavior of mini-batch stochastic gradient descent with a training set of size $n$ and mini-batches of size $m$ . By specifying a martingale model for $\nabla f(X_{t})$ , we recover the stochastic gradient descent and stochastic mirror descent algorithms as solutions to the variational problem described in Section 2.

Let us assume that $\nabla f(X_{t})=\sigma W_{t}^{f}$ , where $\sigma>0$ and $(W^{f}_{t})_{t\geq 0}$ is a Brownian motion. Next, assume that the noisy gradients samples obtained from mini-batches over the course of the optimization, evolve according to the model $\smash{g}_{t}=\sigma(W_{t}^{f}+\rho W_{t}^{e})$ , where $\rho=\sqrt{\nicefrac{{(n-m)}}{{m}}}$ and $W^{e}$ is an independent copy of $W_{t}^{f}$ . Here, we choose $\rho$ so that $\mathbb{V}[g_{t}]=(\nicefrac{{n}}{{m}})\mathbb{V}[\nabla f(X_{t})]=O(m^{-1})$ , which allows the variance to scale in $m$ and $n$ as it does with mini-batches.

Using symmetry, we obtain the trivial solution to the gradient filter, $\mathbb{E}[\nabla f(X_{t}){\lvert}{\mathcal{F}}_{t}]=(1+\rho^{2})^{-1}g_{t}$ , implying that the best estimate of the gradient at the point $X_{t}$ will be the most recent mini-batch sample observed. re-scaled by a constant depending on $n$ and $m$ . Using this expression for the filter, we obtain the following result.

Proposition 5.1.

The FOSP approximation to the solution of the optimality equations (9) can be expressed as \useshortskip

[TABLE]

where $h^{\ast}$ is the convex dual of $h$ and where $\tilde{\Phi}_{t}=e^{-\gamma_{t}}(\Phi_{0}+\int_{0}^{t}e^{\alpha_{u}+\beta_{u}+\gamma_{u}}\,du)$ is a deterministic learning rate with $\smash{\Phi_{0}=e^{\delta_{T}}-\int_{0}^{T}e^{\alpha_{u}+\beta_{u}+\gamma_{u}}\,du}$ . When $h$ has the form $h(x)=x^{\intercal}Mx$ for a symmetric positive-definite matrix $M$ , the FOSP approximation is exact, and (15) is the exact solution to the optimality FBSDE (9). The martingale portion of the solution to (9) can be expressed as ${\mathcal{M}}_{t}={\mathcal{M}}_{0}-(1+\rho^{2})^{-1}\int_{0}^{t}e^{\alpha_{u}+\beta_{u}+\gamma_{u}}\,dg_{u}$ .

Proof.

See Appendix E.1. ∎

To obtain a discrete optimization algorithm from the result of 5.1, we employ a forward-Euler discretization of the ODE (15) on the finite mesh $\mathcal{T}=\{t_{0}=0\,,\;t_{k+1}=t_{k}+e^{-\alpha_{t_{k}}}:k\in\mathbb{N}\}$ . This discretization results in the update rule \useshortskip

[TABLE]

corresponding exactly to mirror descent (e.g. see Beck and Teboulle (2003)) using the noisy mini-batch gradients $g_{t}$ and a time-varying learning rate $\tilde{\Phi}_{t_{k}}$ . Moreover, setting $h(x)=\frac{1}{2}\|x\|^{2}$ , we recover the update rule $X_{t_{k+1}}-X_{t_{k}}=-\tilde{\Phi}_{t_{k}}\,g_{t_{k}}$ , exactly corresponding to the mini-batch SGD with a time-dependent learning rate.

This derivation demonstrates that the solution to the variational problem described in Section 2, under the assumption of a Gaussian model for the evolution of gradients, recovers mirror descent and SGD. In particular, the martingale gradient model proposed in this section can be roughly interpreted as assuming that gradients behave as random walks over the path of the optimizer. Moreover, the optimal gradient filter $\mathbb{E}[\nabla f(X_{t}){\lvert}{\mathcal{F}}_{t}]=(1+\rho^{2})^{-1}g_{t}$ shows that, for the algorithm to be optimal, mini-batch gradients should be re-scaled in proportion to $(1+\rho^{2})^{-1}=\nicefrac{{m}}{{n}}$ .

5.2 Kalman Gradient Descent and Momentum Methods

Using a linear state-space model for gradients, we can recover both the Kalman Gradient Descent algorithm of Vuckovic (2018) and momentum-based optimization methods of Polyak (1964). We assume that each component of $\smash{\nabla f(X_{t})=(\nabla_{i}f(X_{t}))_{i=1}^{d}}$ is modeled independently as a linear diffusive process. Specifically, we assume that there exist processes $\smash{y_{i}=(y_{i,t})_{t\geq 0}}$ so that for each $i$ , $\smash{\nabla_{i}f(X_{t})=b^{\intercal}y_{i,t}}$ , where $\smash{y_{i,t}\in\mathbb{R}^{\tilde{d}}}$ is the solution to the linear SDE $\smash{dy_{i,t}=-A\,y_{i,t}dt+L\,dW_{i,t}}$ . In particular, we the notation $\hat{y}_{i,j,t}$ to refer to element $(i,j)$ of $\smash{\hat{y}\in\mathbb{R}^{d\times\tilde{d}}}$ , and use the notation $\smash{\hat{y}_{\cdot,j,t}=(\hat{y}_{i,j,t})_{i=1}^{d}}$ . We assume here that $\smash{A,L\in\mathbb{R}^{\tilde{d}\times\tilde{d}}}$ are positive definite matrices and each of the $\smash{W_{i}=(W_{i,t})_{t\geq 0}}$ are independent $\tilde{d}$ -dimensional Brownian Motions.

Next, we assume that we may write each element of a noisy gradient process as ${g_{i,t}=b^{\intercal}y_{i,\cdot,t}+\sigma\xi_{i,t}}$ , where $\sigma>0$ and where $\xi_{i}=(\xi_{i,t})_{t\geq 0}$ are independent white noise processes. Noting that $\smash{\mathbb{E}[\,\nabla_{i}f(X_{t+h})\lvert{\mathcal{F}}_{t}]=b^{\intercal}e^{-Ah}y_{i,t}}$ , we find that this model implicitly assumes that gradients are expected decrease in exponentially in magnitude as a function of time, at a rate determined by the eigenvalues of the matrix $A$ . The parameters $\sigma$ and $L$ can be interpreted as controlling the scale of the noise within the observation and signal processes.

Using this model, we obtain that the filter can be expressed as $\smash{\mathbb{E}[\,\nabla_{i}f(X_{t}){\lvert}{\mathcal{F}}_{t}]=b^{\intercal}\hat{y}_{i,t}}$ , where $\hat{y}_{i,t}=\mathbb{E}[y_{i,t}\lvert{\mathcal{F}}_{t}]$ . The process $\hat{y}_{i,t}$ is expressed as the solution to the Kalman-Bucy333For information on continuous time filtering and the Kalman-Bucy filter we refer the reader to the text of Bensoussan (2004) or the lecture notes of Van Handel (2007). filtering equations \useshortskip

[TABLE]

with the initial conditions $\hat{y}_{i,0}=0$ and $\smash{\bar{P}_{0}=\mathbb{E}[y_{i,0}y_{i,0}^{\intercal}]}$ , and where we define innovations process $d\hat{B}_{i,t}=\sigma^{-1}\left(g_{i,t}-b^{\intercal}\hat{y}_{i,t}\right)\,dt$ with the property that each $\hat{B}_{i}$ is an independent ${\mathcal{F}}$ -adapted Brownian motion.

Inserting the linear state space model and its filter into the optimality equations (9) we obtain the following result.

Proposition 5.2 (State-Space Model Solution to the FOSP).

Assume that the gradient state-space model described above holds. The FOSP approximation to the solution of the optimality equations (9) can be expressed as \useshortskip

[TABLE]

where $\tilde{\Phi}_{t}=e^{-\gamma_{t}}(b^{\intercal}e^{-At}\Phi_{0}+\int_{0}^{t}e^{\alpha_{u}+\beta_{u}+\gamma_{u}}b^{\intercal}e^{-A(t-u)}\,du)\in\mathbb{R}^{\tilde{d}}$ is a deterministic learning rate, where $e^{A}$ represents the matrix exponential, and where $\Phi_{0}=e^{\delta_{T}}e^{AT}-\int_{0}^{T}e^{\alpha_{u}+\beta_{u}+\gamma_{u}}e^{Au}\,du$ can be chosen to have arbitrarily large eigenvalues by scaling $\delta_{T}$ . The martingale portion of the solution of (9) can be expressed as ${\mathcal{M}}_{t}={\mathcal{M}}_{0}-\sigma^{-1}\int_{0}^{t}e^{\alpha_{u}+\beta_{u}+\gamma_{u}}b^{\intercal}e^{-A(t-u)}\bar{P}_{u}b\,d\hat{B}_{u}$ .

Proof.

See Appendix E.2 ∎

5.2.1 Kalman Gradient Descent

In order to recover Kalman Gradient Descent, we discretize the processes $X_{t}^{\nu^{\ast}}$ and $\hat{y}$ over the finite mesh $\mathcal{T}$ , defined in equation (18). Applying a Forward-Euler-Maruyama discretization of (18) and the filtering equations (17), we obtain the discrete dynamics

[TABLE]

where each of the $\xi_{i,k}$ and $w_{i,k}$ are standard Gaussian random variables of appropriate size. The filter $\smash{\hat{y}_{i,k}=\mathbb{E}[y_{t_{k}}{\lvert}\{g_{t_{k^{\prime}}}\}_{k^{\prime}=1}^{k}]}$ for the discrete equations can be written as the solution to the discrete Kalman filtering equations, provided in Appendix B. Discretizing the process $X^{\nu^{\ast}}$ over $\mathcal{T}$ with the Forward-Euler scheme, we obtain discrete dynamics for the optimizer in terms of the Kalman Filter $\hat{y}$ , as \useshortskip

[TABLE]

yielding a generalized version of Kalman gradient descent of Vuckovic (2018) with $\tilde{d}$ states for each gradient element. Setting $h(x)=\frac{1}{2}\|x\|^{2}$ , $\tilde{d}=1$ and $b=1$ recovers the original Kalman gradient descent algorithm with a time-varying learning rate.

Just as in Section 5.1, we interpret each $g_{t_{k}}$ as being a mini-batch gradient, as with equation (2). The algorithm (20) computes a Kalman filter from these noisy mini-batch observations and uses it to update the optimizer’s position.

5.2.2 Momentum and Generalized Momentum Methods

By considering the asymptotic behavior of the Kalman gradient descent method described in Section 5.2.1, we recover a generalized version of momentum gradient descent methods, which includes mirror descent behavior, as well as multiple momentum states. Let us assume that $\alpha_{t}=\alpha_{0}$ remains constant in time. Then, using the asymptotic update rule for the Kalman filter, as shown in Proposition B.2, and equation (20), we obtain the update rule

[TABLE]

where $\tilde{A}=I-e^{-\alpha_{0}}A$ and where $K_{\infty}\in\mathbb{R}^{\tilde{d}}$ is defined in the statement of the Proposition B.2. This yields a generalized momentum update rule where we keep track of $\tilde{d}$ momentum states with $(\hat{y}_{i,j,k})_{j=1}^{\tilde{d}}$ , and update its position using a linear update rule. This algorithm can be seen as being most similar to the Aggregated Momentum technique of Lucas et al. (2018), which also keeps track of multiple momentum states which decay at different rates.

Under the special case where $\tilde{d}=1$ , $b=1$ , and $h=\frac{1}{2}\|x\|^{2}$ we recover the exact momentum algorithm update rule of Polyak (1964) as

[TABLE]

where we have a scalar learning rate $\tilde{\Phi}_{t_{k}}$ , where $p_{1}=\tilde{A}-K_{\infty}b^{\intercal}\tilde{A}$ , $p_{2}=K_{\infty}$ are positive scalars, and where $g_{t_{k}}$ are mini-batch draws from the gradient as in equation 2.

The recovery of the momentum algorithm of Polyak (1964) has some interesting consequences. Since $p_{1}$ and $p_{2}$ are functions of the model parameters $\sigma,A$ and $\alpha_{0}$ , we obtain a direct relationship between the optimal choice for the momentum model parameters, the assumed scale of gradient noise $\sigma,L>0$ and the assumed expected rate of decay of gradients, as given by $e^{-At}$ . This result gives insight as to how momentum parameters should be chosen in terms of their prior beliefs on the optimization problem.

6 Discussion and Future Research Directions

Over the course of the paper we present a variational framework on optimizers, which interprets the task of stochastic optimization as an inference problem on a latent surface that we wish to optimize. By solving a variational problem over continuous optimizers with asymmetric information, we find that optimal algorithms should satisfy a system of FBSDEs projected onto the filtration ${\mathcal{F}}$ generated by the noisy observations of the latent process.

By solving these FBSDEs and obtaining continuous-time optimizers, we find a direct relationship between the measure assigned to the latent surface and its relationship to how data is observed. In particular, assigning simple prior models to the pair of processes $(\nabla f(X_{t}),g_{t})_{t\in[0,T]}$ , recovers a number of well-known and widely used optimization algorithms. The fact that this framework can naturally recover these algorithms begs further study. In particular, it is still an open question whether it is possible to recover other stochastic algorithms via this framework, particularly those with second-order scaling adjustments such as ADAM or AdaGrad.

From a more technical perspective, the intent is to further explore properties of the optimization model presented here and the form of the algorithms it suggests. In particular, the optimality FBSDE 9 is nonlinear, high-dimensional and intractable in general, making it difficult to use existing FBSDE approximation techniques, so new tools may need to be developed to understand the full extent of its behavior.

Lastly, numerical work on the algorithms generated by this framework can provide some insights as to which prior gradient models work well when discretized. The extension of simplectic and quasi-simplectic stochastic integrators applied to the BSDEs and SDEs that appear in this paper also has the potential for interesting future work.

Appendix A Obtaining Solutions to the Optimality FBSDE

A.1 A Momentum-Based Representation of the Optimizer Dynamics

Using a simple change of variables we may represent the dynamics of the FBSDE (9) in a simpler fashion, which will aid us in obtaining solutions to this system of equations. Let us define the momentum process $p=(p_{t})_{t\in[0,T]}$ as

[TABLE]

Noting that since $h$ is convex, we have the property that $\nabla h^{\hskip 0.56003pt\mathclap{\ast}}(x)=(\nabla h)^{-1}(x)$ , we may use equation (23) to write $\nu^{\ast}$ in terms of the momentum process as

[TABLE]

The introduction of this process allows us to represent the solution to the optimality FBSDE (9), and by extension the optimizer, in a much more tractable way. Re-writing (9) in terms of $p_{t}$ , we find that

[TABLE]

where the dynamics of the forward process $X^{\nu^{\ast}}$ can be expressed as

[TABLE]

This particular change of variables corresponds exactly to the Hamiltonian representation of the optimizer’s dynamics, which we show in Appendix A.3.

Writing out the explicit solution to the FBSDE (25), we obtain a representation for the optimizer’s dynamics as

[TABLE]

showing that optimizer’s momentum can be represented as a time-weighted average of the expected future gradients over the remainder of the optimization and the term $e^{\gamma_{t}}\nabla^{2}h(X_{t})\,\nu^{\ast}_{t}-e^{\alpha_{t}}p_{t}$ , where the weights are determined by the choice of hyperparameters $\alpha,\beta$ and $\gamma$ . Noting that

[TABLE]

we find that the additional correction term in (27) can be interpreted as the remainder in the first-order Taylor expansion of the term $\nabla h(X_{t}+e^{-\alpha_{t}}\nu^{\ast})$ .

The representation (27) demonstrates optimizer does not only depend on the instantaneous value of gradients at the point $X_{t}^{\nu^{\ast}}$ . Rather, we find that the algorithm’s behaviour depends on the expected value of all future gradients that will be encountered over the remainder of the optimization process, projected onto the set of accumulated gradient information, ${\mathcal{F}}_{t}$ . This is in stark contrast to most known stochastic optimization algorithms which only make explicit use of local gradient information in order to bring the optimizer towards an optimum.

A.2 First-Order Singular Perturbation Approximation

When $h$ does not take the quadratic form $h(x)=\frac{1}{2}x^{\intercal}Mx$ for some positive-definite matrix $M$ , the nonlinear dynamics of the FBSDE (9) or in the equivalent momentum form (25) make it difficult to provive a solution for general $h$ . More precisely, the Taylor expansion term (28) constitutes the main obstacle in obtaining solutions in general.

In cases where the scaling parameter $\alpha_{t}$ is sufficiently large, we can assume that the Taylor expansion remainder term of equation (28) will become negligibly small. Hence, we may approximate the optimality dynamics of the FBSDE (25) by setting this term to zero. This can be interpreted as the first-order term in a singular perturbation expansion of the solution to the momentum FBSDE (25).

Under the assumption that the Taylor remainder term vanishes, we obtain the approximation $\tilde{p}^{\scaleto{\mathstrut(0)}{4.5pt}}=(\tilde{p}^{\scaleto{\mathstrut(0)}{4.5pt}})_{t\in[0,T]}$ for the momentum, which we present in the following proposition.

Proposition A.1 (First-Order Singular Perturbation (FOSP)).

The linear FBSDE

[TABLE]

admits a solution that can be expressed as

[TABLE]

provided that $\mathbb{E}\left[\int_{0}^{T}e^{\gamma_{u}+\alpha_{u}+\beta_{u}}\,\lVert\nabla f\left(X_{u}\right)\lVert\,du\right]<\infty$ .

Proof.

Noting that the remainder term in the expression (28) vanishes, we get that

[TABLE]

Under the assumption that $\alpha,\beta,\delta,\gamma$ are continuous over $[0,T]$ and that $\mathbb{E}\|f(x)\|^{2}\|<\infty$ , the right part of (31) is bounded. Now note that the integral on the left side of (31) is upper bounded for all $T$ by the integral provided in the integrability condition of Proposition A.1, and therefore this condition is a sufficient condition for the expression (31) to be finite and well-defined.

∎

Although a general, model independent bound for the accuracy of such approximations is beyond the scope of this paper, it can still serve as a reasonable and computationally cheap alternative to attempting to solve the original problem dynamics directly with a BSDE numerical scheme. For more information on singular perturbation methods in the context of FBSDEs, see Janković et al. (2012).

A.3 Hamiltonian Representation of the Optimizer Dynamics

Just as in Hamiltonian classical mechanics, it is possible to express the optimality FBSDE of Theorem (4.1) with Hamiltonian equations of motion. We define the Hamiltonian $\mathcal{H}$ as the Legendre dual of $\mathcal{L}$ at, which can be written as

[TABLE]

where $p=\frac{\partial\mathcal{L}}{\partial X}$ . Using the identity $D_{h}(x,y)=D_{h^{\hskip 0.40002pt\mathclap{\ast}}}(\nabla h(x),\nabla h(y))$ , where $h^{\hskip 0.56003pt\mathclap{\ast}}$ is the Legendre dual of $h$ , and inverting the expression for $\frac{\partial\mathcal{L}}{\partial X}$ in terms $p$ , we may compute equation (32) as444See Wibisono et al. (2016)[Appendix B.4] for the full details of the computation.

[TABLE]

Using this definition of $\mathcal{H}$ , and using the FBSDE (9), we obtain the following equivalent representation for the dynamics of the optimizer.

Using the simple substitution $p_{t}=\left(\frac{\partial\mathcal{L}}{\partial X}\right)_{t}$ and noting from equations (10) and (11) that

[TABLE]

a straightforward computation applied to the definition of $\mathcal{H}$ shows that the dynamics of the optimality FBSDE (9) admit the alternate Hamiltonian representation of the optimizer dynamics

[TABLE]

along with the boundary condition $p_{T}=0$ .

Appendix B The Discrete Kalman Filter

Here we present the reader to the Kalman Filtering equations used in Section 5.2. Consider the model presented in equations (19),

[TABLE]

where we use the notation $\tilde{A}_{k}=(I-e^{-\alpha_{t_{k}}}A)$ and $\tilde{L}_{k}=Le^{-\alpha_{t}}$ , and where $w_{i,k}$ and $\xi_{i,k}$ are all independent standard Gaussian random variables. We provide the Kalman filtering equations for this model in the following proposition.

Proposition B.1 (Walrand and Dimakis (2006, Theorem 10.2)).

Let $\hat{y}_{i,k}=\mathbb{E}[y_{t_{k}}{\lvert}\sigma(g_{t_{k^{\prime}}})_{k^{\prime}=1}^{k}]$ . Then $\hat{y}_{i,k}$ satisfies the recursive equation

[TABLE]

where the matrices $K_{k}$ are obtained via the independent recursive equations

[TABLE]

For more information on the discrete Kalman filter, its derivation and for asymptotic properties, we refer the reader to the lecture notes Walrand and Dimakis (2006).

Next, we provide a result on the asumptotic properties of the Kalman filter in the proposition that follows.

Proposition B.2 (Walrand and Dimakis (2006, Theorem 11.2)).

Assume that $\alpha_{t_{k}}=\alpha_{t_{0}}$ is constant, so that $\tilde{A}_{k}=\tilde{A}$ and $\tilde{L}_{k}=\tilde{L}$ become constant, and assume that there exists a positive-definite solution $K_{\infty}\in\mathbb{R}^{\tilde{d}\times\tilde{d}}$ to the algebraic matrix equation

[TABLE]

Then, we may write the asymptotic dynamics of the filter $\hat{y}_{i}$ as

[TABLE]

where $K_{\infty}$ is the solution to the system of algebraic matrix equations

[TABLE]

For more information on the Kalman Filter, its derivation and theoretical properties, see Walrand and Dimakis (2006).

Appendix C Proofs Relating to Theorem 4.1

Before going forward with the main part of the proof, we first present a lemma for the computation of the Gâteaux derivative of $\mathcal{J}$ .

Lemma C.1.

The functional $\mathcal{J}$ is everywhere Gâteaux differentiable in $\mathcal{A}$ . The Gâteaux at a point $\nu\in\mathcal{A}$ in the direction $\tilde{\omega}=\omega-\nu$ for $\omega\in\mathcal{A}$ takes the form

[TABLE]

Proof.

If we assume that the conditions of Leibniz’ rule hold, we may compute the Gâteax derivative as

[TABLE]

where we have

[TABLE]

Note here that the derivative in $f$ is path-wise for every fixed realization of the function $f$ . Since $f\in C^{1}$ , we have that $\nabla f$ is also well-defined for every realization of $f$ .

To ensure that this computation is valid, and that the conditions of the Leibniz rule are met, due to the continuity of (46) in $\tilde{\omega}$ , is sufficient for us to show that the integrals in equation (46) are bounded for any $\tilde{\omega}$ and $\nu$ . First, note that by the Young and Jensen inequalities,

[TABLE]

where the boundedness holds from the fact that $\tilde{\omega}\in\mathcal{A}$ and that $\mathbb{E}\|f(x)\|^{2}<\infty$ for all $x\in\mathbb{R}^{d}$ .

Next, we focus on the left part of equation (46). By the Cauchy-Schwarz and Young inequalities, we have

[TABLE]

Using the $L$ -Lipschitz property of the gradients of $h$ , we can also bound the partial derivatives of the Lagrangian with the triangle inequality as

[TABLE]

where $C_{0}=\sup_{t\in[0,T]}\{e^{\alpha_{t}+\gamma_{t}}+e^{\gamma_{t}}+e^{\alpha_{t}+\gamma_{t}+\beta_{t}}\}$ is bounded by the assumption that $\alpha,\beta,\gamma$ are continuous in $[0,T]$ .

Using the above result, and applying Young’s inequality to the previous result, we can upper bound equation (51) as

[TABLE]

where the number 32 is chosen to be much larger than what is strictly necessary by Young’s inequality. Notice here that by the definition of $\mathcal{A}$ , this forms an integrable upper bound to the left integral of equation (46), validating our use of Leibniz’s rule, and showing that $\mathcal{J}$ is indeed Gâteaux integrable.

Now that integrability concerns have been dealt with, we can proceed with the computation of the Gâteaux derivative. By applying integration by parts to the left side of equation (54) and moving the right hand side into the integral, we obtain

[TABLE]

Using the tower property and Fubini’s theorem on the right, we get

[TABLE]

as desired. ∎

C.1 Proof of Theorem 4.1

Using the representation of the Gâteux derivative of $\mathcal{J}$ brought forth by Lemma C.1, we may proceed with the proof of Theorem 4.1.

Proof of Theorem 4.1.

The goal is to show that the BSDE (9) is a necessary and sufficient condition for $\nu^{\ast}$ to be a critical point of $\mathcal{J}$ . For any Gâteaux differentiable function $\mathcal{J}$ , a necessary and sufficient condition for a point $\nu^{\ast}\in\mathcal{A}$ to be a critical point is that its Gâteaux derivative vanished in any valid direction. Lemma C.1 shows that the Gâteaux derivative takes the form of equation (45). Therefore, all that remains is to show that the FBSDE 9 is a necessary and sufficient condition for equation (45) to vanish.

**Sufficiency. ** We will show that equation (45) vanishes when the FBSDE (9) holds. Assume that there exists a solution to the FBSDE (9) satisfying $\nu^{\ast}\in\mathcal{A}$ . We may then express the solution to the FBSDE explicitly as

[TABLE]

Inserting this into the right side of (45), we find that $\left\langle D\mathcal{J}(\nu),\omega\right\rangle$ vanishes for all $\omega\in\mathcal{A}$ , demonstrating sufficiency.

**Necessity. ** Conversely, let us assume that $\left\langle D\mathcal{J}(\nu),\omega-\nu\right\rangle=0$ for all $\omega\in\mathcal{A}$ and for some $\nu\in\mathcal{A}$ for which the FBSDE (9) is not satisfied. We will show by contradiction that this statement cannot hold by choosing a direction in which the Gâteax derivative does not vanish. Consider the choice

[TABLE]

for some sufficiently small $\rho>0$ . We will first show that $\omega^{\rho}\in\mathcal{A}$ for some $\rho>0$ .

First, note that clearly $\omega^{\rho}$ must be ${\mathcal{F}}_{t}$ -adapted, and we have $\omega^{0}=\nu_{t}$ . Moreover, note that since $\nu\in\mathcal{A}$ , we have that $\mathbb{E}\int_{0}^{T}\,\lVert\nu_{t}\lVert^{2}+\lVert\nabla f(X^{\nu})\lVert^{2}\,dt<\infty$ , that $\omega^{0}=\nu$ . Notice that by the continuity of $\nabla f$ and the definition of $X$ , the expression

[TABLE]

is continuous in $\rho$ . Since (56) is bounded for $\rho=0$ , by continuity there exists some $\rho>0$ for which (56) is bounded and by extension where $\omega^{\rho}\in\mathcal{A}$ for this same value of $\rho$ .

Inserting (55) into the Gâteaux derivative (45), we get that

[TABLE]

which is strictly positive unless the FBSDE (9) is satisfied, thus forming a contradiction and demonstrating that the condition is necessary. ∎

Appendix D Proof of Theorem 4.2

Proof.

The proof of this theorem is broken up into multiple parts. The idea will be to first show that the energy functional $\mathcal{E}$ is a super-martingale with respect to ${\mathcal{F}}_{t}$ , and then to use this property to bound the expected distance to the optimum. Lastly, we bound a quadratic co-variation term which appears within these equations to obtain the final result.

Before delving into the proof, we introduce standard notation for semi-martingale calculus. We use the noation $dY_{t}=dY_{t}^{c}+\Delta Y_{t}$ to indicate the increments of the continuous part $Y^{c}$ of a process $Y$ and its discontinuities $\Delta Y_{t}=Y_{t}-Y_{t-}$ , where we use the notation $t-$ to indicate the left limit of the process. We use the notation $[Y,Z]_{t}$ to represent the quadratic co-variation of two processes $Y$ and $Z$ . This quadratic variation term can be decomposed into $d[Y,Z]_{t}=d[Y,Z]_{t}^{c}+\langle\Delta Y_{t},\Delta Z_{t}\rangle$ , where $[Y,Z]_{t}^{c}$ represents the quadratic covariation between $Y^{c}$ and $Z^{c}$ , and where $\langle\Delta Y_{t},\Delta Z_{t}\rangle$ represents the inner product of their discontinuities at $t$ . For more information on semi-martingale calculus and the associated notation, see Jacod and Shiryaev (Jacod and Shiryaev, 2013, Sections 3-5).

Dynamics of the Bregman Divergence. The idea will now be to show that the energy functional $\mathcal{E}$ , defined in equation (13), is a super-martingale with respect to the visible filtration ${\mathcal{F}}_{t}$ .

Using Itô’s formula and Itô’s product rule for càdlàg semi-martingales Jacod and Shiryaev (2013)[Theorem 4.57], as well as the short-hand notation $Y_{t}=X_{t}+e^{-\alpha_{t}}\nu^{\ast}_{t}$ , we obtain

[TABLE]

where from line 1 to 2, we use the identity $d[\nabla g(Y),Y]_{t}=\sum_{i,j}\frac{\partial^{2}g(Y_{t})}{\partial x_{i}\partial x_{j}}d[Y_{i},Y_{j}]_{t}^{c}+\langle\Delta(\nabla g(Y_{t})),\Delta Y_{t}\rangle$ for any $C^{2}$ function $g$ .

Note that since $h$ is convex, $\nabla^{2}h$ must have positive eigenvalues, and hence $\frac{1}{2}\sum_{i,j=1}^{d}\frac{\partial^{2}h(Y_{t})}{\partial x_{i}\partial x_{i}}d\left[Y_{i},Y_{j}\right]_{t}^{c}\geq 0$ . The convexity of $h$ also implies that $\langle\nabla h(x)-\nabla h(y),x-y\rangle\leq 0$ , and therefore we get $\langle\Delta\left(\nabla h(Y_{t})\right),\Delta Y_{t}\rangle\geq 0$ . The convexity of $h$ also implies that $\Delta h(Y_{t})-\langle\nabla h(Y_{t}),\Delta Y_{t}\rangle\geq 0$ . Combining these observations, we find that

[TABLE]

**Super-martingale property of $\mathcal{E}$ . ** Applying the scaling conditions to the optimality FBSDE (9), we obtain the dynamics

[TABLE]

Inserting this in to the dynamics of for the energy functional, and applying the upper bound (59), we find that

[TABLE]

where we use the notation $\mathcal{M}^{\prime}_{t}$ to represent the ${\mathcal{F}}_{t}$ -martingale defined as

[TABLE]

Now note that due to the assumed convexity of $f$ , we have that $D_{f}(x^{\star},Y_{t})$ is almost surely non-negative. Second, by the scaling conditions, $e^{\alpha_{t}}-\dot{\beta}_{t}$ is positive. Hence, the drift in equation (63) is almost surely negative, and $\mathcal{E}_{t}$ is a super-martingale.

Using the super-martingale property, we find that $\mathbb{E}\left[\mathcal{E}_{t}\right]\leq\mathbb{E}\left[\mathcal{E}_{0}\right]=\mathbb{E}\left[D_{h}(x^{\star},X_{0}+e^{-\alpha_{0}}\nu_{0})+e^{\beta_{0}}\left(f(X_{0})-f(x^{\star})\right)\right]=C_{0}\;,$ where $C_{0}\geq 0$ . Using the definition of $\mathcal{E}$ , and using the fact that $D_{h}\geq 0$ if $h$ is convex, we obtain

[TABLE]

**Upper bound on the Quadratic Co-variation. ** Now we upper bound the quadratic co-variation term appearing on the right hand side of (65). Using the further change of variable $Z_{t}=\nabla h(Y_{t})$ , and noting that by the assumed convexity of $h$ that $\nabla h^{\hskip 0.56003pt\mathclap{\ast}}(x)=(\nabla h)^{-1}(x)$ , we get $[\nabla h(Y),Y]_{t}=[Z,\nabla h^{\hskip 0.56003pt\mathclap{\ast}}(Z)]_{t}$ .

Assuming that $\nabla h$ is $\mu$ -strongly convex, we get that $\nabla h^{\hskip 0.56003pt\mathclap{\ast}}$ must have $\mu^{-1}$ -Lipschitz smooth gradients. This implies that (i) the eigenvalues of $\nabla^{2}h^{\hskip 0.56003pt\mathclap{\ast}}$ must be bounded above by $\mu^{-1}$ (ii) from the Cauchy-Schwarz inequality, we have $\langle\nabla h^{\hskip 0.56003pt\mathclap{\ast}}(x)-\nabla h^{\hskip 0.56003pt\mathclap{\ast}}(y),x-y\rangle\leq\mu^{-1}\|x-y\|^{2}$ . Using these two observations and writing out the expression for $[Z,\nabla h^{\hskip 0.56003pt\mathclap{\ast}}(Z)]_{t}$ , we get

[TABLE]

Moreover, note that since $Z_{t}=\nabla h(X_{t}^{\nu^{\ast}}+e^{-\alpha_{t}}\nu^{\ast}_{t})$ and since $\nabla h(X_{t}^{\nu^{\ast}})$ is a process of finite variation, the optimality dynamics (9) imply that $[Z]_{t}=[e^{-\gamma_{t}}\mathcal{M}]_{t}]=e^{-\gamma_{t}}[\mathcal{M}]_{t}$

Inserting the quadratic co-variation bound into equation (65) and using the super-martingale property, we obtain the final result

[TABLE]

as desired.

∎

Appendix E Proofs of Propositions 5.1 and Proposition 5.2

Both of the proofs contained in this sections are applications of the momentum representation of the optimizer dynamics, and the FOSP approximation to the solution of the optimality FBSDE (9).

E.1 Proof of Proposition 5.1

Proof.

Using Proposition A.1, we find that the solution to the FOSP takes the form

[TABLE]

Applying Fubini’s theorem, and the martingale property of $\mathbb{E}\left[\nabla f\left(X_{u}\right)\lvert{\mathcal{F}}_{u}\right]=\nicefrac{{g_{u}}}{{(1+\rho^{2})}}$ , we find that

[TABLE]

Inserting expression above into equation (24), and re-arranging terms, we obtain the desired result. ∎

E.2 Proof of Proposition 5.2

Proof.

Using Proposition A.1, we find that the solution to the FOSP takes the form

[TABLE]

Applying Fubini’s theorem, and noting that $\mathbb{E}[\,\nabla_{i}f(X_{t+h})\lvert y_{i,t}]=\sum_{j=1}^{\tilde{d}}(b^{\intercal}e^{-Ah})_{j}\,y_{\cdot,j,t}$ , we obtain

[TABLE]

Inserting expression above into equation (24), and re-arranging terms, we obtain the desired result. ∎

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aitchison (2018) Laurence Aitchison. A unified theory of adaptive stochastic gradient descent as bayesian filtering. ar Xiv preprint ar Xiv:1807.07540 , 2018.
2Beck and Teboulle (2003) Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters , 31(3):167–175, 2003.
3Bensoussan (2004) Alain Bensoussan. Stochastic control of partially observable systems . Cambridge University Press, 2004.
4Carmona (2016) René Carmona. Lectures on BSD Es, stochastic control, and stochastic differential games with financial applications , volume 1. SIAM, 2016.
5Casgrain and Jaimungal (2018 a) Philippe Casgrain and Sebastian Jaimungal. Mean field games with partial information for algorithmic trading. ar Xiv preprint ar Xiv:1803.04094 , 2018 a.
6Casgrain and Jaimungal (2018 b) Philippe Casgrain and Sebastian Jaimungal. Mean-field games with differing beliefs for algorithmic trading. ar Xiv preprint ar Xiv:1810.06101 , 2018 b.
7Casgrain and Jaimungal (2018 c) Philippe Casgrain and Sebastian Jaimungal. Trading algorithms with learning in latent alpha models. ar Xiv preprint ar Xiv:1806.04472 , 2018 c.
8Cesa-Bianchi et al. (2004) Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory , 50(9):2050–2057, 2004.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A Latent Variational Framework for Stochastic Optimization

Abstract

1 Introduction

1.1 Related Work

1.2 Contribution

1.3 Paper Structure

2 A Statistical Model for Stochastic Optimization

3 The Optimizer’s Variational Problem

Remark 1**.**

4 Critical Points of the Expected Action Functional

Theorem 4.1** (Stochastic Euler-Lagrange Equation).**

Proof.

4.1 Expected Rates of Convergence of the Continuous Algorithm

Theorem 4.2** (Convergence Rate).**

Proof.

Remark 2**.**

5 Recovering Discrete Optimization Algorithms

5.1 Stochastic Gradient Descent and Stochastic Mirror Descent

Proposition 5.1**.**

Proof.

5.2 Kalman Gradient Descent and Momentum Methods

Proposition 5.2** (State-Space Model Solution to the FOSP).**

Proof.

5.2.1 Kalman Gradient Descent

5.2.2 Momentum and Generalized Momentum Methods

6 Discussion and Future Research Directions

Appendix A Obtaining Solutions to the Optimality FBSDE

A.1 A Momentum-Based Representation of the Optimizer Dynamics

A.2 First-Order Singular Perturbation Approximation

Proposition A.1** (First-Order Singular Perturbation (FOSP)).**

Proof.

A.3 Hamiltonian Representation of the Optimizer Dynamics

Appendix B The Discrete Kalman Filter

Proposition B.1** (Walrand and Dimakis (2006, Theorem 10.2)).**

Proposition B.2** (Walrand and Dimakis (2006, Theorem 11.2)).**

Appendix C Proofs Relating to Theorem 4.1

Lemma C.1**.**

Proof.

C.1 Proof of Theorem 4.1

Proof of Theorem 4.1.

Appendix D Proof of Theorem 4.2

Proof.

Appendix E Proofs of Propositions 5.1 and Proposition 5.2

E.1 Proof of Proposition 5.1

Proof.

E.2 Proof of Proposition 5.2

Proof.

Remark 1.

Theorem 4.1 (Stochastic Euler-Lagrange Equation).

Theorem 4.2 (Convergence Rate).

Remark 2.

Proposition 5.1.

Proposition 5.2 (State-Space Model Solution to the FOSP).

Proposition A.1 (First-Order Singular Perturbation (FOSP)).

Proposition B.1 (Walrand and Dimakis (2006, Theorem 10.2)).

Proposition B.2 (Walrand and Dimakis (2006, Theorem 11.2)).

Lemma C.1.