A Stochastic Interpretation of Stochastic Mirror Descent: Risk-Sensitive   Optimality

Navid Azizan; Babak Hassibi

arXiv:1904.01855·math.OC·April 4, 2019

A Stochastic Interpretation of Stochastic Mirror Descent: Risk-Sensitive Optimality

Navid Azizan, Babak Hassibi

PDF

TL;DR

This paper presents a new interpretation of stochastic mirror descent as a risk-sensitive optimal estimator within exponential family distributions, and proposes a modified symmetric version of SMD.

Contribution

It introduces a risk-sensitive interpretation of SMD and proposes a symmetric variant, extending theoretical understanding of these algorithms in non-Gaussian settings.

Findings

01

SMD can be viewed as a risk-sensitive estimator for exponential family distributions.

02

A modified symmetric SMD (SSMD) is proposed based on this interpretation.

03

The analysis extends SMD properties beyond Gaussian assumptions using Bregman divergence.

Abstract

Stochastic mirror descent (SMD) is a fairly new family of algorithms that has recently found a wide range of applications in optimization, machine learning, and control. It can be considered a generalization of the classical stochastic gradient algorithm (SGD), where instead of updating the weight vector along the negative direction of the stochastic gradient, the update is performed in a "mirror domain" defined by the gradient of a (strictly convex) potential function. This potential function, and the mirror domain it yields, provides considerable flexibility in the algorithm compared to SGD. While many properties of SMD have already been obtained in the literature, in this paper we exhibit a new interpretation of SMD, namely that it is a risk-sensitive optimal estimator when the unknown weight vector and additive noise are non-Gaussian and belong to the exponential family of…

Equations88

L (w) = i = 1 \sum n L_{i} (w),

L (w) = i = 1 \sum n L_{i} (w),

\nabla ψ (w_{i}) = \nabla ψ (w_{i - 1}) - η \nabla L (w_{i - 1}), w_{0}

\nabla ψ (w_{i}) = \nabla ψ (w_{i - 1}) - η \nabla L (w_{i - 1}), w_{0}

\nabla ψ (w_{i}) = \nabla ψ (w_{i - 1}) - η \nabla L_{i} (w_{i - 1}), w_{0}

\nabla ψ (w_{i}) = \nabla ψ (w_{i - 1}) - η \nabla L_{i} (w_{i - 1}), w_{0}

\nabla ψ (w_{i}) = \nabla ψ (w_{i - 1}) - η_{i} \nabla L_{i} (w_{i - 1}), w_{0}

\nabla ψ (w_{i}) = \nabla ψ (w_{i - 1}) - η_{i} \nabla L_{i} (w_{i - 1}), w_{0}

D_{ψ} (w, w^{'}) = ψ (w) - ψ (w^{'}) - \nabla ψ (w^{'})^{T} (w - w^{'}) .

D_{ψ} (w, w^{'}) = ψ (w) - ψ (w^{'}) - \nabla ψ (w^{'})^{T} (w - w^{'}) .

∥ w - w^{'} ∥^{2} = ∥ w - w^{''} ∥^{2} + ∥ w^{''} - w^{'} ∥^{2} - 2 (w^{'} - w^{''})^{T} (w - w^{''})

∥ w - w^{'} ∥^{2} = ∥ w - w^{''} ∥^{2} + ∥ w^{''} - w^{'} ∥^{2} - 2 (w^{'} - w^{''})^{T} (w - w^{''})

D_{ψ} (w, w^{'}) = D_{ψ} (w, w^{''}) + D_{ψ} (w^{''}, w^{'}) - (\nabla ψ (w^{'}) - \nabla ψ (w^{''}))^{T} (w - w^{''}) .

D_{ψ} (w, w^{'}) = D_{ψ} (w, w^{''}) + D_{ψ} (w^{''}, w^{'}) - (\nabla ψ (w^{'}) - \nabla ψ (w^{''}))^{T} (w - w^{''}) .

D_{ψ_{1}} (w, w_{1}) + D_{ψ_{2}} (w, w_{2}) = D_{ψ_{1}} (w_{*}, w_{1}) + D_{ψ_{2}} (w_{*}, w_{2}) + D_{ψ_{1} + ψ_{2}} (w, w_{*}),

D_{ψ_{1}} (w, w_{1}) + D_{ψ_{2}} (w, w_{2}) = D_{ψ_{1}} (w_{*}, w_{1}) + D_{ψ_{2}} (w_{*}, w_{2}) + D_{ψ_{1} + ψ_{2}} (w, w_{*}),

\nabla (ψ_{1} + ψ_{2}) (w_{*}) = \nabla ψ_{1} (w_{1}) + \nabla ψ_{2} (w_{2}) .

\nabla (ψ_{1} + ψ_{2}) (w_{*}) = \nabla ψ_{1} (w_{1}) + \nabla ψ_{2} (w_{2}) .

E \nabla ψ (w) = \nabla ψ (w_{0}) .

E \nabla ψ (w) = \nabla ψ (w_{0}) .

{(x_{i}, y_{i}), i = 1, \dots n}

{(x_{i}, y_{i}), i = 1, \dots n}

y_{i} = f (x_{i}, w) + v_{i}, i = 1, \dots n

y_{i} = f (x_{i}, w) + v_{i}, i = 1, \dots n

L (w) = i = 1 \sum n L_{i} (w) ℓ (y_{i}, f (x_{i}, w)),

L (w) = i = 1 \sum n L_{i} (w) ℓ (y_{i}, f (x_{i}, w)),

L (w) = i = 1 \sum n ℓ (y_{i} - f (x_{i}, w)) .

L (w) = i = 1 \sum n ℓ (y_{i} - f (x_{i}, w)) .

\nabla ψ (w_{i}) = \nabla ψ (w_{i - 1}) + η \frac{\partial f ( x _{i} , w _{i - 1} )}{\partial w} ℓ^{'} (y_{i} - f (x_{i}, w_{i - 1})), w_{0} .

\nabla ψ (w_{i}) = \nabla ψ (w_{i - 1}) + η \frac{\partial f ( x _{i} , w _{i - 1} )}{\partial w} ℓ^{'} (y_{i} - f (x_{i}, w_{i - 1})), w_{0} .

y_{i} = x_{i}^{T} w + v_{i}, i = 1, \dots, n

y_{i} = x_{i}^{T} w + v_{i}, i = 1, \dots, n

\nabla ψ (w_{i}) = \nabla ψ (w_{i - 1}) + η x_{i} ℓ^{'} (y_{i} - x_{i}^{T} w_{i - 1}), w_{0} .

\nabla ψ (w_{i}) = \nabla ψ (w_{i - 1}) + η x_{i} ℓ^{'} (y_{i} - x_{i}^{T} w_{i - 1}), w_{0} .

w_{i} = \mbox a r g w min D_{ψ} (w, w_{i - 1}) + η w^{T} \nabla L_{i} (w_{i - 1}),

w_{i} = \mbox a r g w min D_{ψ} (w, w_{i - 1}) + η w^{T} \nabla L_{i} (w_{i - 1}),

E_{i} (w_{i}, w_{i - 1}) := D_{ψ - η L_{i}} (w_{i}, w_{i - 1}) + η L_{i} (w_{i}) .

E_{i} (w_{i}, w_{i - 1}) := D_{ψ - η L_{i}} (w_{i}, w_{i - 1}) + η L_{i} (w_{i}) .

D_{ψ} (w, w_{i - 1}) + η ℓ (v_{i}) = D_{ψ} (w, w_{i}) + η D_{L_{i}} (w, w_{i - 1}) + E_{i} (w_{i}, w_{i - 1}) .

D_{ψ} (w, w_{i - 1}) + η ℓ (v_{i}) = D_{ψ} (w, w_{i}) + η D_{L_{i}} (w, w_{i - 1}) + E_{i} (w_{i}, w_{i - 1}) .

D_{ψ} (w, w_{0}) + η i = 1 \sum T ℓ (v_{i}) = D_{ψ} (w, w_{T}) + η i = 1 \sum T D_{L_{i}} (w, w_{i - 1}) + i = 1 \sum T E_{i} (w_{i}, w_{i - 1})

D_{ψ} (w, w_{0}) + η i = 1 \sum T ℓ (v_{i}) = D_{ψ} (w, w_{T}) + η i = 1 \sum T D_{L_{i}} (w, w_{i - 1}) + i = 1 \sum T E_{i} (w_{i}, w_{i - 1})

{w_{i}} min w, {v_{i}} max \frac{D _{ψ} ( w , w _{T} ) + η \sum _{i = 1}^{T} D _{L_{i}} ( w , w _{i - 1} )}{D _{ψ} ( w , w _{0} ) + η \sum _{i = 1}^{T} ℓ ( v _{i} )} = 1

{w_{i}} min w, {v_{i}} max \frac{D _{ψ} ( w , w _{T} ) + η \sum _{i = 1}^{T} D _{L_{i}} ( w , w _{i - 1} )}{D _{ψ} ( w , w _{0} ) + η \sum _{i = 1}^{T} ℓ ( v _{i} )} = 1

D_{L_{i}} (w, w_{i - 1}) = (x_{i}^{T} (w - w_{i - 1}))^{2},

D_{L_{i}} (w, w_{i - 1}) = (x_{i}^{T} (w - w_{i - 1}))^{2},

{w_{i}} min w, {v_{i}} max \frac{∥ w - w _{T} ∥ ^{2} + η \sum _{i = 1}^{T} ( x _{i}^{T} ( w - w _{i - 1} ) ) ^{2}}{∥ w - w _{0} ∥ ^{2} + η \sum _{i = 1}^{T} v _{i}^{2}}

{w_{i}} min w, {v_{i}} max \frac{∥ w - w _{T} ∥ ^{2} + η \sum _{i = 1}^{T} ( x _{i}^{T} ( w - w _{i - 1} ) ) ^{2}}{∥ w - w _{0} ∥ ^{2} + η \sum _{i = 1}^{T} v _{i}^{2}}

W = {w \in R^{m} ∣ y_{i} = x_{i}^{T} w, i = 1, \dots, n} .

W = {w \in R^{m} ∣ y_{i} = x_{i}^{T} w, i = 1, \dots, n} .

w_{\infty} = w \in W arg min D_{ψ} (w, w_{0}) .

w_{\infty} = w \in W arg min D_{ψ} (w, w_{0}) .

w_{\infty} = w \in W arg min ψ (w) .

w_{\infty} = w \in W arg min ψ (w) .

{z_{i}} min E_{∣ {y_{i}}} [\frac{1}{2} i = 1 \sum T (x_{i}^{T} w - z_{i})^{2}],

{z_{i}} min E_{∣ {y_{i}}} [\frac{1}{2} i = 1 \sum T (x_{i}^{T} w - z_{i})^{2}],

{z_{i}} min E_{∣ {y_{i}}} [i = 1 \sum T D_{ℓ} (y_{i} - x_{i}^{T} w, y_{i} - z_{i})] .

{z_{i}} min E_{∣ {y_{i}}} [i = 1 \sum T D_{ℓ} (y_{i} - x_{i}^{T} w, y_{i} - z_{i})] .

{z_{i}} min E_{∣ {y_{i}}} exp (\frac{1}{2} i = 1 \sum T (x_{i}^{T} w - z_{i})^{2}),

{z_{i}} min E_{∣ {y_{i}}} exp (\frac{1}{2} i = 1 \sum T (x_{i}^{T} w - z_{i})^{2}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsStochastic Gradient Descent

Full text

A Stochastic Interpretation of Stochastic Mirror Descent:

Risk-Sensitive Optimality

Navid Azizan and Babak Hassibi This work was supported in part by the National Science Foundation under grants CCF-1423663, CCF-1409204 and ECCS-1509977, by a grant from Qualcomm Inc., by NASA’s Jet Propulsion Laboratory through the President and Director’s Fund, and by an Amazon (AWS) AI Fellowship.N. Azizan is with the Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA 91125, USA [email protected]. Hassibi is with the Department of Electrical Engineering, California Institute of Technology, Pasadena, CA 91125, USA [email protected]

Abstract

Stochastic mirror descent (SMD) is a fairly new family of algorithms that has recently found a wide range of applications in optimization, machine learning, and control. It can be considered a generalization of the classical stochastic gradient algorithm (SGD), where instead of updating the weight vector along the negative direction of the stochastic gradient, the update is performed in a “mirror domain” defined by the gradient of a (strictly convex) potential function. This potential function, and the mirror domain it yields, provides considerable flexibility in the algorithm compared to SGD. While many properties of SMD have already been obtained in the literature, in this paper we exhibit a new interpretation of SMD, namely that it is a risk-sensitive optimal estimator when the unknown weight vector and additive noise are non-Gaussian and belong to the exponential family of distributions. The analysis also suggests a modified version of SMD, which we refer to as symmetric SMD (SSMD). The proofs rely on some simple properties of Bregman divergence, which allow us to extend results from quadratics and Gaussians to certain convex functions and exponential families in a rather seamless way.

I Introduction

Stochastic mirror descent (SMD) has become one of the most widely used families of algorithms for optimization, machine learning, and beyond [1, 2, 3, 4, 5, 6, 7], which includes the popular stochastic gradient descent (SGD) as a special case. The convergence behavior of such algorithms have been extensively studied in the literature [8, 9], under various assumptions. Several other properties and interpretations of SMD have recently been proven in the literature[10, 11]. In earlier work, we have demonstrated a fundamental conservation law for SMD and have used it to establish properties such as minimax optimality, deterministic convergence, and implicit regularization [12, 6]. The main contribution of this paper is to provide a new stochastic interpretation of SMD, i.e., that it is risk-sensitive optimal. This generalizes a similar result about SGD in the literature [13, 14]. We also propose a new “more symmetric” version of SMD, called symmetric SMD (SSMD), which is suggested by our analysis.

The paper is organized as follows. We review the main properties of SMD and the notion of Bregman divergence in Section II. The risk-sensitive optimality result and its proof, as well as the new SSMD algorithm are provided is Section III. We finally mention another stochastic result about SMD in Section IV, and conclude in Section V.

II Background

Consider a separable loss function of some unknown parameter (or weight) vector $w\in\mathbb{R}^{p}$ :

[TABLE]

where the $L_{i}(\cdot)$ are called the instantaneous (or local) loss functions, and where our goal is to minimize $L(\cdot)$ over $w$ . For example, the conventional gradient descent (GD) algorithm can be used as an attempt to perform such minimization. A generalization of GD, called the mirror descent (MD) algorithm, was first introduced by Nemirovski and Yudin [1] and can be described as follows. Consider a strictly convex differentiable function $\psi(\cdot)$ , called the potential function. Then MD is given by the following recursion

[TABLE]

where $\eta>0$ is known as the step size or learning rate. Note that, due to the strict convexity of $\psi(\cdot)$ , the gradient $\nabla\psi(\cdot)$ defines an invertible map so that the recursion in (1) yields a unique $w_{i}$ at each iteration. Compared to classical GD, rather than update the weight vector along the direction of the negative gradient, the update is done in the “mirrored” domain determined by the invertible transformation $\nabla\psi(\cdot)$ . Mirror descent was originally conceived to exploit the geometrical structure of the problem by choosing an appropriate potential. Note that MD reduces to GD when $\psi(w)=\frac{1}{2}\|w\|^{2}$ , since the gradient is simply the identity map. Other examples include the exponentiated gradient descent (aka the exponential weights) and the $p$ -norms algorithm [15, 16]. As with GD, it is straightforward to show that MD converges to a local minimum of $L(\cdot)$ , provided the step size $\eta$ is small enough.

When $n$ is large, computation of the entire gradient may be cumbersome. Alternatively, in online scenarios, the entire loss function $L(\cdot)$ may not be available and only the local loss functions may be provided at each iteration. In such settings, a stochastic version of MD has been introduced, aptly called stochastic mirror descent (SMD), and which can be considered the straightforward generalization of stochastic gradient descent (SGD):

[TABLE]

In the offline setting, the various instantaneous loss functions $L_{i}(\cdot)$ can either be drawn at random, or cycled through periodically. In the online setting, they are provided at each iteration. Unlike MD (and GD), for a fixed step size $\eta$ , SMD does not generally converge, unless there exists a $w$ that simultaneously minimizes every local loss function $L_{i}(\cdot)$ .111Since if this is not the case, even if the current estimate were at a local minimum of global loss function $L(\cdot)$ , $w_{*}$ , say, any of the local gradients $\nabla L_{i}(w_{*})$ could be nonzero which would move us away from $w_{*}$ . For this reason, SMD with vanishing learning rate has also been considered

[TABLE]

where the learning rate is chosen such that $\eta_{i}\rightarrow 0$ . With a vanishing learning rate it is not surprising that one can attain convergence (since after a while the algorithm is barely updating the weight vector)—what is more interesting is the fact that under suitably decaying rates one can obtain convergence to a local minimum of $L(\cdot)$ (more on this below).

II-A Bregman Divergence

For any given strictly convex differentiable potential function $\psi(\cdot)$ , the Bregman divergence is defined as

[TABLE]

In other words, the Bregman divergence is the difference between the value of the function $\psi(\cdot)$ at a point $w$ and the value of its linear (or first order) approximation around another point $w^{\prime}$ (see Fig. 1). Since a defining property of a convex function is that its linear approximations always lies below it, we have that $D_{\psi}(w,w^{\prime})\geq 0$ . Furthermore, since $\psi(\cdot)$ is strictly convex, we have that $D_{\psi}(w,w^{\prime})=0$ iff $w=w^{\prime}$ . Finally, it can be observed that $D_{\psi}(\cdot,\cdot)$ is convex in its first argument (but not necessarily in the second).

Since the Bregman divergence retains the quadratic (and higher order) terms in the error of the linear approximation of $\psi(w)$ around $w^{\prime}$ , it inherits many of the properties of quadratics. For example, the classical “law of cosines”

[TABLE]

generalizes to

[TABLE]

More important for our developments is the following generalization of “completion-of-squares”, which we formalize as a lemma.

Lemma 1.

Let $\psi_{1}(\cdot)$ and $\psi_{2}(\cdot)$ be strictly convex differentiable functions. Then it holds that

[TABLE]

where $w_{*}$ is the unique solution to the equation

[TABLE]

Proof.

The identities can be verified by straightforward calculation. The uniqueness of $w_{*}$ follows from the fact that $\psi_{1}(\cdot)+\psi_{2}(\cdot)$ is strictly convex since it is the sum of two such functions.

For example, if $\psi(w)=\|w\|^{2}$ then $D(w,w^{\prime})=\|w-w^{\prime}\|^{2}$ , and if $\psi(p)=-H(p)$ , where $p$ is a probability vector, then we get that $D_{-H}(p,p^{\prime})=\sum_{i}p_{i}\log\frac{p_{i}}{p^{\prime}_{i}}$ is the KL divergence (or relative entropy).

The last fact about the Bregman divergence that we would like to mention is that a random variable $w$ that has a distribution $w\sim e^{-D_{\psi}(\cdot,w_{0})}$ (i.e. $p(w)=ce^{-D_{\psi}(w,w_{0})}$ for a suitable normalization constant $c$ ) is a member of the exponential family of distributions, and satisfies the property

[TABLE]

In other words, $w_{0}$ is the point whose mirror is the mean of the mirror map.

II-B Parametric Models

It will now be useful to introduce some parametric models and make our loss functions more explicit. To this end, assume we have a collection of data points

[TABLE]

where $x_{i}\in\mathbb{R}^{m}$ is the input and $y_{i}\in\mathbb{R}$ is the output. We will assume that the pairs $(x_{i},y_{i})$ are related through some parametric model

[TABLE]

where $f(\cdot,\cdot)$ is a given function and represents the modeling class we are considering, $w\in\mathbb{R}^{p}$ is the unknown weight vector (or parameter), and $v_{i}$ represents both measurement noise and modeling errors. In this setting, the global loss function can be written as

[TABLE]

where $\ell(\cdot,\cdot)$ is a (differentiable) local loss function, with the property that $\ell(y_{i},f(x_{i},w))=0$ iff $y_{i}=f(x_{i},w)$ . Often $\ell(y_{i},f(x_{i},w))=\ell(y_{i}-f(x_{i},w))$ , with $\ell(\cdot)$ convex and having a global minimum at zero. In this case,

[TABLE]

For example, for quadratic loss we obtain $L(w)=\sum_{i=1}^{n}\frac{1}{2}(y_{i}-f(x_{i},w))^{2}$ . For (11), SMD takes the explicit form

[TABLE]

An important special case is that of linear models

[TABLE]

where SMD takes the form

[TABLE]

II-C Local and Global Interpretations of SMD

It is straightforward to show that at each iteration, SMD solves the following optimization problem:

[TABLE]

which can be verified by setting the gradient of the right hand side of (15) to zero. What the above relation shows is that the SMD iterates try to align themselves with the direction of the instantaneous gradient, while also trying to stay close to the previous iterate in Bregman divergence. (The learning rate relatively weights these two objectives.) We refer to (15) as the local interpretation of SMD.

We have recently shown that SMD satisfies the following local conservation law [12, 6].

Lemma 2 (Local Conservation Law [12]).

Even though the loss function $L_{i}(w)=\ell(y_{i}-f(x_{i},w))$ may not be convex, define the Bregman divergence $D_{L_{i}}(w,w^{\prime})$ in the usual way. Further define the quantity

[TABLE]

Then for each iteration of the SMD updates (12), it holds that

[TABLE]

Summing the local identities in (17) from time 1 to time $T$ leads to the following global conservation law

[TABLE]

Note that (18) holds for any horizon $T$ . We refer to it as the global interpretation of SMD. It can be used to show several remarkable deterministic properties of the SMD algorithm. We now mention a couple.

II-D Minimax Optimality of SMD

Using the aforementioned global identity, in [12, 6], the following has been shown.

Theorem 3 (Minimax Optimality [12]).

For any $T$ , provided $\eta$ is small enough so that $\psi(w)-\eta L_{i}(w)$ is convex for all $i$ , then

[TABLE]

and SMD with learning rate $\eta$ is a minimax optimal algorithm achieving the above.

Theorem 3 is a generalization of the $H^{\infty}$ -optimality of the SGD algorithm for linear models and quadratic loss, where it is referred to as LMS [13, 14, 17], to SMD and general models and general losses. When the potential and loss are quadratic, we have $D_{\psi}(w,w_{0})=\|w-w_{0}\|^{2}$ and $\ell(v_{i})=v_{i}^{2}$ . The quantity $D_{L_{i}}(w,w_{i-1})=(y_{i}-x_{i}^{T}w)^{2}-(y_{i}-x_{i}^{T}w_{i-1})^{2}+2x_{i}^{T}(w-w_{i-1})(y_{i}-x_{i}^{T}w_{i-1})$ , after some simplification, takes on the form

[TABLE]

which is the square of the so-called prediction error. In this case, we recover the $H^{\infty}$ -optimality of LMS, namely that it solves

[TABLE]

and the optimal value is $1$ . As mentioned above, Theorem 3 generalizes $H^{\infty}$ -optimality in three ways: it holds for general potential, general loss function, and general nonlinear model.

II-E Convergence and Implicit Regularization

Another interesting property of SMD, which again can be proven using the global conservation law (18), is what is referred to as implicit regularization. In over-parameterized (underdetermined) models, which are common in compressed sensing and modern deep learning problems, there are (typically a lot) more parameters (unknowns) than data points (measurements). That means there are many parameter vectors (in fact infinitely many) that are consistent with the observations:

[TABLE]

The questions of interest in this regime are (1) does SMD converge to a solution? and (2) if it does so, which solution does it converge to? The following result answers these questions.

Theorem 4 (Convergence to the “Closest” Point[12]).

Suppose $l(\cdot)$ is differentiable and convex and has a unique root at [math], $\psi(\cdot)$ is strictly convex, and $\eta>0$ is such that $\psi-\eta L_{i}$ is convex for all $i$ . Then for any $w_{0}$ , the SMD iterates converge to

[TABLE]

Corollary 5 (Implicit Regularization[12]).

In particular, for the initialization $w_{0}=\operatorname*{arg\,min}_{w\in\mathbb{R}^{m}}\psi(w)$ , under the conditions of Theorem 4, the SMD iterates converge to

[TABLE]

This means that running SMD, without any (explicit) regularization, results in a solution that has the smallest potential $\psi(\cdot)$ among all solutions, i.e., SMD implicitly regularizes the solution with $\psi(\cdot)$ . In principle, one can choose the potential function for any desired convex regularization. For example, we can find the maximum entropy solution by taking the potential to be the negative entropy, or do compressed sensing with $\psi(w)=\|w\|_{1+\epsilon}$ [12, 6].

We should remark that the result extends to quasi-convex losses $\ell(\cdot)$ , and it holds locally (in an approximate sense) even for nonlinear models (non-convex cost).

III Main Results

The results about SMD discussed in the previous section were deterministic. In this section, we give a stochastic interpretation of SMD, and show that it is risk-sensitive optimal.

III-A Risk-Sensitive Optimality of SMD

Consider a stochastic model $y_{i}=x_{i}^{T}w+v_{i},i\geq 1$ , where $w$ and $\{v_{i}\}$ are independent random variables with distributions $w\sim e^{-\frac{1}{\eta}D_{\psi}(\cdot,w_{0})}$ and $v_{i}\sim e^{-\ell(\cdot)}$ , which are members of the exponential family (note that when the potential function $\psi(\cdot)$ and the loss $\ell(\cdot)$ are square, both of these are Gaussian). A conventional quadratic estimator is one that minimizes the expected sum of squared prediction errors, i.e.,

[TABLE]

where the expectation is taken over $w$ and $\{v_{i}\}$ conditioned on the observations, and each $z_{i}$ in the minimization can only be a function of observations until time $i-1$ . For various problems, one may be interested in cost functions more general than quadratic, i.e.,

[TABLE]

The estimators that solve problems (23) and (24) are referred to as “risk-neutral” estimators.

An alternative criterion is the “risk-sensitive” (or exponential cost) criterion, which was first introduced in [18] and studied in [19, 20, 21]. In particular, an estimator that solves the problem

[TABLE]

is called a “risk-averse” estimator. The reason is that in such a criterion, very large weights are placed on large errors, and hence, the estimator is more concerned about large values of error (their rare occurrence) than the moderate values of error.

Similar as in (24), one can consider exponential cost of errors measured with a more general distance than quadratic, i.e.,

[TABLE]

It has been shown in [14, 13] that SGD for square loss (aka LMS) solves the problem (25). In other words, LMS is risk-sensitive optimal. Formally, the result is as follows.

Theorem 6 (Hassibi et al.[13]).

Consider the model $y_{i}=x_{i}^{T}w+v_{i},i\geq 1$ , where $w$ and $\{v_{i}\}$ are independent Gaussian random variables with means $w_{0}$ and [math] and variances $\eta I$ and $I$ , respectively. Further, suppose that $\{x_{i}\}$ are persistently exciting and $0<\eta<\frac{1}{\|x_{i}\|^{2}},\forall i$ . Then the solution to the following optimization problem

[TABLE]

where the expectation is taken over $w$ conditioned on the observations, and $z_{i}$ is only allowed to depend on observations up to time $i-1$ , is given by $z_{i}=x_{i}^{T}w_{i-1}$ , where $\{w_{i}\}$ are the SGD iterates.

We should further remark that no larger exponent than $1/2$ is possible (no algorithm can attain a finite cost if the exponent is larger than $1/2$ ).

The following result generalizes the risk-sensitive optimality of SGD for quadratic errors, to that of SMD for general Bregman-divergence errors.

Theorem 7.

Consider the model $y_{i}=x_{i}^{T}w+v_{i},i\geq 1$ , where $w$ and $\{v_{i}\}$ are independent random variables with distributions $w\sim e^{-\frac{1}{\eta}D_{\psi}(\cdot,w_{0})}$ and $v_{i}\sim e^{-l(\cdot)}$ . Further, suppose that $\{x_{i}\}$ are persistently exciting, and $\psi-\eta L_{i}$ is strictly convex for all $i$ . Then the solution to the following optimization problem

[TABLE]

where the expectation is taken over $w$ conditioned on the observations, and $z_{i}$ is only allowed to depend on observations up to time $i-1$ , is given by $z_{i}=x_{i}^{T}w_{i-1}$ , where $\{w_{i}\}$ are the SMD iterates.

III-B Proof of Theorem 7

The expected exponential cost that needs to be minimized in Theorem 7 is given by

[TABLE]

where $C$ is a normalization constant that guarantees we are integrating the cost against a conditional distribution. The challenge in evaluating the above integral over $w$ is that $w$ appears in all three terms of the exponent. In order to facilitate the computation of this integral, it will be useful to use the completion-of-squares formula of Lemma 7 to gather $w$ into a single term. The following lemma provides precisely what we need.

Lemma 8.

It holds that

[TABLE]

where the $w_{i}$ , $i=1,\ldots,T$ are given by the recursion

[TABLE]

Proof.

The proof is based on telescopically summing the local identity

[TABLE]

from $i=1$ to $i=T$ , where the $w_{i}$ are given through the recursion (27). This local identity can be either verified directly or obtained through two successive uses of Lemma 7.

As promised, Lemma 27 gathers $w$ into a single term so that the integral over $w$ can be performed. Once this integral is performed, we are left with the following cost function

[TABLE]

where $C^{\prime}$ is a constant obtained after integrating out $w$ . The above cost function must be recursively minimized over the $z_{i}$ , which are only allowed to be functions of $\{y_{j},j<i\}$ , respectively. It is not clear how to do so from the above expression. The next lemma provides an identity that makes this recursive minimization straightforward.

Lemma 9.

It holds that

[TABLE]

Proof.

This can be verified by perhaps tedious, but straightforward, calculations.

In view of Lemma 9, the cost function to recursively minimize is

[TABLE]

Note that, at any time $i$ , the only term that $z_{i}$ has control over (in the sense that it is a term that depends only on past $y_{j}$ ) is the term

[TABLE]

(The other terms that are influenced by $z_{i}$ , such as $w_{i}$ , are influenced also by $y_{i}$ —see (27)—so that $z_{i}$ cannot knowledgeably minimize them.) The term $D_{\ell}(y_{i}-x_{i}^{T}w_{i-1},y_{i}-z_{i})$ can be minimized, and in fact set to zero, by taking

[TABLE]

which when plugging into (27) yields SMD. This completes the proof. (The attentive reader will have noticed that we needed Lemma 9 since it was not clear how to minimize $D_{\ell}(y_{i}-x_{i}^{T}w_{i},y_{i}-z_{i})$ over $z_{i}$ , since we could not have taken $z_{i}=x_{i}^{T}w_{i}$ as $w_{i}$ depends on $y_{i}$ and $z_{i}$ is not allowed to.)

III-C Symmetric SMD (SSMD)

Our proof of the risk-sensitive optimality of SMD has led us to an alternative, and more symmetric version, of the algorithm that we refer to as symmetric SMD (or SSMD) and which may be of independent interest. The SSMD iterations are given by

[TABLE]

SSMD satisfies the following risk-sensitive optimality.

Theorem 10.

Consider the model $y_{i}=x_{i}^{T}w+v_{i},i\geq 1$ , where $w$ and $\{v_{i}\}$ are independent random variables with $w|\{y_{i}\}\sim e^{-\frac{1}{\eta}D_{\psi}(\cdot,w_{0})-D_{\ell}(x_{i}^{T}\cdot,y_{i})}$ . Further, suppose that $\{x_{i}\}$ are persistently exciting, and $\psi-\eta L_{i}$ is strictly convex for all $i$ . Then the solution to the following optimization problem

[TABLE]

where the expectation is taken over $w$ conditioned on the observations, and $z_{i}$ is only allowed to depend on observations up to time $i-1$ , is given by $z_{i}=x_{i}^{T}w_{i-1}$ , where $\{w_{i}\}$ are the SSMD iterates.

Proof.

The proof is similar to that of Theorem 7 and is omitted for brevity.

We note that the difference between SMD and SSMD is that the noise is now distributed according to $v_{i}\sim e^{-D_{\ell}(x_{i}^{T}w,y_{i})}$ , rather than $v_{i}\sim e^{-\ell(y_{i}-x_{i}^{T}w)}$ , and that the exponent of the cost function is $D_{\ell}(x_{i}^{T}w,z_{i})$ , rather than $D_{\ell}(y-x_{i}^{T}w,y_{i}-z_{i})$ . The distributions and costs for SSMD appear to be more natural.

IV Other Stochastic Results

In the previous sections, we showed several fundamental deterministic and stochastic properties of SMD. One may ask how do these results relate to the conventional mean-square convergence results, such as [8]. It turns out that the fundamental identity (conservation law (18)) of SMD allows proving such stochastic convergence results in a direct way (which avoids appealing to stochastic differential equations and ergodic averaging) [6].

As mentioned before, for vanishing step size, convergence of any algorithm is not surprising, and is in fact trivial (because you are not updating anymore). However, the more interesting question is whether the algorithm converges to anything interesting. It turns out that when the data points are generated according to a stochastic model with white noise, SMD converges to the “true” parameter. More specifically, consider a model $y_{i}=x_{i}^{T}w+v_{i},i\geq 1,$ where $v_{i}$ are iid with ${\mathbb{E}}\left[v_{i}\right]=0$ and ${\mathbb{E}}\left[v_{i}^{2}\right]=\sigma^{2}$ , and the inputs $x_{i}$ are “persistently exciting,” i.e., for any $\delta>0$ , there exists $T>0$ s.t. $\sum_{i=1}^{T}x_{i}x_{i}^{T}\succeq\delta I$ . Note that this is different from the setting of Theorem 7, in that the noises $v_{i}$ need not be Gaussian or from the the exponential family (the only assumption is whiteness), and the parameter $w$ is deterministic. One can show that SMD with decaying step size indeed converges to $w$ , under suitable conditions on the step size sequence.

Theorem 11.

Consider the model $y_{i}=x_{i}^{T}w+v_{i},i\geq 1,$ where ${\mathbb{E}}\left[v_{i}\right]=0$ , ${\mathbb{E}}\left[v_{i}v_{j}\right]=\sigma^{2}\delta_{ij}$ , and the $x_{i}$ are persistently exciting. The stochastic mirror descent iterates for any strongly convex potential $\psi(\cdot)$ , and a convex loss $\ell(\cdot)$ with a unique root at [math], converge to $w$ in a mean-square sense, if the the step size sequence $\{\eta_{i}\}$ satisfies $\sum_{i=1}^{\infty}\eta_{i}=\infty,\sum_{i=1}^{\infty}\eta_{i}^{2}<\infty$ .

The step size conditions $\sum_{i=1}^{\infty}\eta_{i}=\infty,\sum_{i=1}^{\infty}\eta_{i}^{2}<\infty$ are known as Robbins–Monro [22] conditions.

V Conclusion

In this paper, we reviewed several fundamental properties of stochastic mirror descent (SMD) family of algorithms, and provided a new stochastic interpretation of them, namely, that they are risk-sensitive optimal. The result generalizes a known result in the literature about the special case of SGD (aka LMS). Our analysis inspired a new algorithm, which is a “more symmetric” variant of SMD. Future work may concern studying this new algorithm and its convergence properties in more detail.

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Nemirovski and D. B. Yudin, “Problem complexity and method efficiency in optimization.” 1983.
2[2] A. Beck and M. Teboulle, “Mirror descent and nonlinear projected subgradient methods for convex optimization,” Operations Research Letters , vol. 31, no. 3, pp. 167–175, 2003.
3[3] N. Cesa-Bianchi, P. Gaillard, G. Lugosi, and G. Stoltz, “Mirror descent meets fixed share (and feels no regret),” in Advances in Neural Information Processing Systems , 2012, pp. 980–988.
4[4] Z. Zhou, P. Mertikopoulos, N. Bambos, S. Boyd, and P. W. Glynn, “Stochastic mirror descent in variationally coherent optimization problems,” in Advances in Neural Information Processing Systems , 2017, pp. 7043–7052.
5[5] A. Nedic and S. Lee, “On stochastic subgradient mirror-descent algorithm with weighted averaging,” SIAM Journal on Optimization , vol. 24, no. 1, pp. 84–107, 2014.
6[6] N. Azizan and B. Hassibi, “A characterization of stochastic mirror descent algorithms and their convergence properties,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2019.
7[7] M. Raginsky and J. Bouvrie, “Continuous-time stochastic mirror descent on a network: Variance reduction, consensus, convergence,” in 2012 IEEE 51st IEEE Conference on Decision and Control (CDC) . IEEE, 2012, pp. 6793–6800.
8[8] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on optimization , vol. 19, no. 4, pp. 1574–1609, 2009.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A Stochastic Interpretation of Stochastic Mirror Descent:

Abstract

I Introduction

II Background

II-A Bregman Divergence

Lemma 1**.**

Proof.

II-B Parametric Models

II-C Local and Global Interpretations of SMD

Lemma 2** (Local Conservation Law [12]).**

II-D Minimax Optimality of SMD

Theorem 3** (Minimax Optimality [12]).**

II-E Convergence and Implicit Regularization

Theorem 4** (Convergence to the “Closest” Point[12]).**

Corollary 5** (Implicit Regularization[12]).**

III Main Results

III-A Risk-Sensitive Optimality of SMD

Theorem 6** (Hassibi et al.[13]).**

Theorem 7**.**

III-B Proof of Theorem 7

Lemma 8**.**

Proof.

Lemma 9**.**

Proof.

III-C Symmetric SMD (SSMD)

Theorem 10**.**

Proof.

IV Other Stochastic Results

Theorem 11**.**

V Conclusion

Lemma 1.

Lemma 2 (Local Conservation Law [12]).

Theorem 3 (Minimax Optimality [12]).

Theorem 4 (Convergence to the “Closest” Point[12]).

Corollary 5 (Implicit Regularization[12]).

Theorem 6 (Hassibi et al.[13]).

Theorem 7.

Lemma 8.

Lemma 9.

Theorem 10.

Theorem 11.