A Stochastic Interpretation of Stochastic Mirror Descent: Risk-Sensitive Optimality
Navid Azizan, Babak Hassibi

TL;DR
This paper presents a new interpretation of stochastic mirror descent as a risk-sensitive optimal estimator within exponential family distributions, and proposes a modified symmetric version of SMD.
Contribution
It introduces a risk-sensitive interpretation of SMD and proposes a symmetric variant, extending theoretical understanding of these algorithms in non-Gaussian settings.
Findings
SMD can be viewed as a risk-sensitive estimator for exponential family distributions.
A modified symmetric SMD (SSMD) is proposed based on this interpretation.
The analysis extends SMD properties beyond Gaussian assumptions using Bregman divergence.
Abstract
Stochastic mirror descent (SMD) is a fairly new family of algorithms that has recently found a wide range of applications in optimization, machine learning, and control. It can be considered a generalization of the classical stochastic gradient algorithm (SGD), where instead of updating the weight vector along the negative direction of the stochastic gradient, the update is performed in a "mirror domain" defined by the gradient of a (strictly convex) potential function. This potential function, and the mirror domain it yields, provides considerable flexibility in the algorithm compared to SGD. While many properties of SMD have already been obtained in the literature, in this paper we exhibit a new interpretation of SMD, namely that it is a risk-sensitive optimal estimator when the unknown weight vector and additive noise are non-Gaussian and belong to the exponential family of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsStochastic Gradient Descent
A Stochastic Interpretation of Stochastic Mirror Descent:
Risk-Sensitive Optimality
Navid Azizan and Babak Hassibi This work was supported in part by the National Science Foundation under grants CCF-1423663, CCF-1409204 and ECCS-1509977, by a grant from Qualcomm Inc., by NASA’s Jet Propulsion Laboratory through the President and Director’s Fund, and by an Amazon (AWS) AI Fellowship.N. Azizan is with the Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA 91125, USA [email protected]. Hassibi is with the Department of Electrical Engineering, California Institute of Technology, Pasadena, CA 91125, USA [email protected]
Abstract
Stochastic mirror descent (SMD) is a fairly new family of algorithms that has recently found a wide range of applications in optimization, machine learning, and control. It can be considered a generalization of the classical stochastic gradient algorithm (SGD), where instead of updating the weight vector along the negative direction of the stochastic gradient, the update is performed in a “mirror domain” defined by the gradient of a (strictly convex) potential function. This potential function, and the mirror domain it yields, provides considerable flexibility in the algorithm compared to SGD. While many properties of SMD have already been obtained in the literature, in this paper we exhibit a new interpretation of SMD, namely that it is a risk-sensitive optimal estimator when the unknown weight vector and additive noise are non-Gaussian and belong to the exponential family of distributions. The analysis also suggests a modified version of SMD, which we refer to as symmetric SMD (SSMD). The proofs rely on some simple properties of Bregman divergence, which allow us to extend results from quadratics and Gaussians to certain convex functions and exponential families in a rather seamless way.
I Introduction
Stochastic mirror descent (SMD) has become one of the most widely used families of algorithms for optimization, machine learning, and beyond [1, 2, 3, 4, 5, 6, 7], which includes the popular stochastic gradient descent (SGD) as a special case. The convergence behavior of such algorithms have been extensively studied in the literature [8, 9], under various assumptions. Several other properties and interpretations of SMD have recently been proven in the literature[10, 11]. In earlier work, we have demonstrated a fundamental conservation law for SMD and have used it to establish properties such as minimax optimality, deterministic convergence, and implicit regularization [12, 6]. The main contribution of this paper is to provide a new stochastic interpretation of SMD, i.e., that it is risk-sensitive optimal. This generalizes a similar result about SGD in the literature [13, 14]. We also propose a new “more symmetric” version of SMD, called symmetric SMD (SSMD), which is suggested by our analysis.
The paper is organized as follows. We review the main properties of SMD and the notion of Bregman divergence in Section II. The risk-sensitive optimality result and its proof, as well as the new SSMD algorithm are provided is Section III. We finally mention another stochastic result about SMD in Section IV, and conclude in Section V.
II Background
Consider a separable loss function of some unknown parameter (or weight) vector :
[TABLE]
where the are called the instantaneous (or local) loss functions, and where our goal is to minimize over . For example, the conventional gradient descent (GD) algorithm can be used as an attempt to perform such minimization. A generalization of GD, called the mirror descent (MD) algorithm, was first introduced by Nemirovski and Yudin [1] and can be described as follows. Consider a strictly convex differentiable function , called the potential function. Then MD is given by the following recursion
[TABLE]
where is known as the step size or learning rate. Note that, due to the strict convexity of , the gradient defines an invertible map so that the recursion in (1) yields a unique at each iteration. Compared to classical GD, rather than update the weight vector along the direction of the negative gradient, the update is done in the “mirrored” domain determined by the invertible transformation . Mirror descent was originally conceived to exploit the geometrical structure of the problem by choosing an appropriate potential. Note that MD reduces to GD when , since the gradient is simply the identity map. Other examples include the exponentiated gradient descent (aka the exponential weights) and the -norms algorithm [15, 16]. As with GD, it is straightforward to show that MD converges to a local minimum of , provided the step size is small enough.
When is large, computation of the entire gradient may be cumbersome. Alternatively, in online scenarios, the entire loss function may not be available and only the local loss functions may be provided at each iteration. In such settings, a stochastic version of MD has been introduced, aptly called stochastic mirror descent (SMD), and which can be considered the straightforward generalization of stochastic gradient descent (SGD):
[TABLE]
In the offline setting, the various instantaneous loss functions can either be drawn at random, or cycled through periodically. In the online setting, they are provided at each iteration. Unlike MD (and GD), for a fixed step size , SMD does not generally converge, unless there exists a that simultaneously minimizes every local loss function .111Since if this is not the case, even if the current estimate were at a local minimum of global loss function , , say, any of the local gradients could be nonzero which would move us away from . For this reason, SMD with vanishing learning rate has also been considered
[TABLE]
where the learning rate is chosen such that . With a vanishing learning rate it is not surprising that one can attain convergence (since after a while the algorithm is barely updating the weight vector)—what is more interesting is the fact that under suitably decaying rates one can obtain convergence to a local minimum of (more on this below).
II-A Bregman Divergence
For any given strictly convex differentiable potential function , the Bregman divergence is defined as
[TABLE]
In other words, the Bregman divergence is the difference between the value of the function at a point and the value of its linear (or first order) approximation around another point (see Fig. 1). Since a defining property of a convex function is that its linear approximations always lies below it, we have that . Furthermore, since is strictly convex, we have that iff . Finally, it can be observed that is convex in its first argument (but not necessarily in the second).
Since the Bregman divergence retains the quadratic (and higher order) terms in the error of the linear approximation of around , it inherits many of the properties of quadratics. For example, the classical “law of cosines”
[TABLE]
generalizes to
[TABLE]
More important for our developments is the following generalization of “completion-of-squares”, which we formalize as a lemma.
Lemma 1**.**
Let and be strictly convex differentiable functions. Then it holds that
[TABLE]
where is the unique solution to the equation
[TABLE]
Proof.
The identities can be verified by straightforward calculation. The uniqueness of follows from the fact that is strictly convex since it is the sum of two such functions.
For example, if then , and if , where is a probability vector, then we get that is the KL divergence (or relative entropy).
The last fact about the Bregman divergence that we would like to mention is that a random variable that has a distribution (i.e. for a suitable normalization constant ) is a member of the exponential family of distributions, and satisfies the property
[TABLE]
In other words, is the point whose mirror is the mean of the mirror map.
II-B Parametric Models
It will now be useful to introduce some parametric models and make our loss functions more explicit. To this end, assume we have a collection of data points
[TABLE]
where is the input and is the output. We will assume that the pairs are related through some parametric model
[TABLE]
where is a given function and represents the modeling class we are considering, is the unknown weight vector (or parameter), and represents both measurement noise and modeling errors. In this setting, the global loss function can be written as
[TABLE]
where is a (differentiable) local loss function, with the property that iff . Often , with convex and having a global minimum at zero. In this case,
[TABLE]
For example, for quadratic loss we obtain . For (11), SMD takes the explicit form
[TABLE]
An important special case is that of linear models
[TABLE]
where SMD takes the form
[TABLE]
II-C Local and Global Interpretations of SMD
It is straightforward to show that at each iteration, SMD solves the following optimization problem:
[TABLE]
which can be verified by setting the gradient of the right hand side of (15) to zero. What the above relation shows is that the SMD iterates try to align themselves with the direction of the instantaneous gradient, while also trying to stay close to the previous iterate in Bregman divergence. (The learning rate relatively weights these two objectives.) We refer to (15) as the local interpretation of SMD.
We have recently shown that SMD satisfies the following local conservation law [12, 6].
Lemma 2** (Local Conservation Law [12]).**
Even though the loss function may not be convex, define the Bregman divergence in the usual way. Further define the quantity
[TABLE]
Then for each iteration of the SMD updates (12), it holds that
[TABLE]
Summing the local identities in (17) from time 1 to time leads to the following global conservation law
[TABLE]
Note that (18) holds for any horizon . We refer to it as the global interpretation of SMD. It can be used to show several remarkable deterministic properties of the SMD algorithm. We now mention a couple.
II-D Minimax Optimality of SMD
Using the aforementioned global identity, in [12, 6], the following has been shown.
Theorem 3** (Minimax Optimality [12]).**
For any , provided is small enough so that is convex for all , then
[TABLE]
and SMD with learning rate is a minimax optimal algorithm achieving the above.
Theorem 3 is a generalization of the -optimality of the SGD algorithm for linear models and quadratic loss, where it is referred to as LMS [13, 14, 17], to SMD and general models and general losses. When the potential and loss are quadratic, we have and . The quantity , after some simplification, takes on the form
[TABLE]
which is the square of the so-called prediction error. In this case, we recover the -optimality of LMS, namely that it solves
[TABLE]
and the optimal value is . As mentioned above, Theorem 3 generalizes -optimality in three ways: it holds for general potential, general loss function, and general nonlinear model.
II-E Convergence and Implicit Regularization
Another interesting property of SMD, which again can be proven using the global conservation law (18), is what is referred to as implicit regularization. In over-parameterized (underdetermined) models, which are common in compressed sensing and modern deep learning problems, there are (typically a lot) more parameters (unknowns) than data points (measurements). That means there are many parameter vectors (in fact infinitely many) that are consistent with the observations:
[TABLE]
The questions of interest in this regime are (1) does SMD converge to a solution? and (2) if it does so, which solution does it converge to? The following result answers these questions.
Theorem 4** (Convergence to the “Closest” Point[12]).**
Suppose is differentiable and convex and has a unique root at [math], is strictly convex, and is such that is convex for all . Then for any , the SMD iterates converge to
[TABLE]
Corollary 5** (Implicit Regularization[12]).**
In particular, for the initialization , under the conditions of Theorem 4, the SMD iterates converge to
[TABLE]
This means that running SMD, without any (explicit) regularization, results in a solution that has the smallest potential among all solutions, i.e., SMD implicitly regularizes the solution with . In principle, one can choose the potential function for any desired convex regularization. For example, we can find the maximum entropy solution by taking the potential to be the negative entropy, or do compressed sensing with [12, 6].
We should remark that the result extends to quasi-convex losses , and it holds locally (in an approximate sense) even for nonlinear models (non-convex cost).
III Main Results
The results about SMD discussed in the previous section were deterministic. In this section, we give a stochastic interpretation of SMD, and show that it is risk-sensitive optimal.
III-A Risk-Sensitive Optimality of SMD
Consider a stochastic model , where and are independent random variables with distributions and , which are members of the exponential family (note that when the potential function and the loss are square, both of these are Gaussian). A conventional quadratic estimator is one that minimizes the expected sum of squared prediction errors, i.e.,
[TABLE]
where the expectation is taken over and conditioned on the observations, and each in the minimization can only be a function of observations until time . For various problems, one may be interested in cost functions more general than quadratic, i.e.,
[TABLE]
The estimators that solve problems (23) and (24) are referred to as “risk-neutral” estimators.
An alternative criterion is the “risk-sensitive” (or exponential cost) criterion, which was first introduced in [18] and studied in [19, 20, 21]. In particular, an estimator that solves the problem
[TABLE]
is called a “risk-averse” estimator. The reason is that in such a criterion, very large weights are placed on large errors, and hence, the estimator is more concerned about large values of error (their rare occurrence) than the moderate values of error.
Similar as in (24), one can consider exponential cost of errors measured with a more general distance than quadratic, i.e.,
[TABLE]
It has been shown in [14, 13] that SGD for square loss (aka LMS) solves the problem (25). In other words, LMS is risk-sensitive optimal. Formally, the result is as follows.
Theorem 6** (Hassibi et al.[13]).**
Consider the model , where and are independent Gaussian random variables with means and [math] and variances and , respectively. Further, suppose that are persistently exciting and . Then the solution to the following optimization problem
[TABLE]
where the expectation is taken over conditioned on the observations, and is only allowed to depend on observations up to time , is given by , where are the SGD iterates.
We should further remark that no larger exponent than is possible (no algorithm can attain a finite cost if the exponent is larger than ).
The following result generalizes the risk-sensitive optimality of SGD for quadratic errors, to that of SMD for general Bregman-divergence errors.
Theorem 7**.**
Consider the model , where and are independent random variables with distributions and . Further, suppose that are persistently exciting, and is strictly convex for all . Then the solution to the following optimization problem
[TABLE]
where the expectation is taken over conditioned on the observations, and is only allowed to depend on observations up to time , is given by , where are the SMD iterates.
III-B Proof of Theorem 7
The expected exponential cost that needs to be minimized in Theorem 7 is given by
[TABLE]
where is a normalization constant that guarantees we are integrating the cost against a conditional distribution. The challenge in evaluating the above integral over is that appears in all three terms of the exponent. In order to facilitate the computation of this integral, it will be useful to use the completion-of-squares formula of Lemma 7 to gather into a single term. The following lemma provides precisely what we need.
Lemma 8**.**
It holds that
[TABLE]
where the , are given by the recursion
[TABLE]
Proof.
The proof is based on telescopically summing the local identity
[TABLE]
from to , where the are given through the recursion (27). This local identity can be either verified directly or obtained through two successive uses of Lemma 7.
As promised, Lemma 27 gathers into a single term so that the integral over can be performed. Once this integral is performed, we are left with the following cost function
[TABLE]
where is a constant obtained after integrating out . The above cost function must be recursively minimized over the , which are only allowed to be functions of , respectively. It is not clear how to do so from the above expression. The next lemma provides an identity that makes this recursive minimization straightforward.
Lemma 9**.**
It holds that
[TABLE]
Proof.
This can be verified by perhaps tedious, but straightforward, calculations.
In view of Lemma 9, the cost function to recursively minimize is
[TABLE]
Note that, at any time , the only term that has control over (in the sense that it is a term that depends only on past ) is the term
[TABLE]
(The other terms that are influenced by , such as , are influenced also by —see (27)—so that cannot knowledgeably minimize them.) The term can be minimized, and in fact set to zero, by taking
[TABLE]
which when plugging into (27) yields SMD. This completes the proof. (The attentive reader will have noticed that we needed Lemma 9 since it was not clear how to minimize over , since we could not have taken as depends on and is not allowed to.)
III-C Symmetric SMD (SSMD)
Our proof of the risk-sensitive optimality of SMD has led us to an alternative, and more symmetric version, of the algorithm that we refer to as symmetric SMD (or SSMD) and which may be of independent interest. The SSMD iterations are given by
[TABLE]
SSMD satisfies the following risk-sensitive optimality.
Theorem 10**.**
Consider the model , where and are independent random variables with . Further, suppose that are persistently exciting, and is strictly convex for all . Then the solution to the following optimization problem
[TABLE]
where the expectation is taken over conditioned on the observations, and is only allowed to depend on observations up to time , is given by , where are the SSMD iterates.
Proof.
The proof is similar to that of Theorem 7 and is omitted for brevity.
We note that the difference between SMD and SSMD is that the noise is now distributed according to , rather than , and that the exponent of the cost function is , rather than . The distributions and costs for SSMD appear to be more natural.
IV Other Stochastic Results
In the previous sections, we showed several fundamental deterministic and stochastic properties of SMD. One may ask how do these results relate to the conventional mean-square convergence results, such as [8]. It turns out that the fundamental identity (conservation law (18)) of SMD allows proving such stochastic convergence results in a direct way (which avoids appealing to stochastic differential equations and ergodic averaging) [6].
As mentioned before, for vanishing step size, convergence of any algorithm is not surprising, and is in fact trivial (because you are not updating anymore). However, the more interesting question is whether the algorithm converges to anything interesting. It turns out that when the data points are generated according to a stochastic model with white noise, SMD converges to the “true” parameter. More specifically, consider a model where are iid with and , and the inputs are “persistently exciting,” i.e., for any , there exists s.t. . Note that this is different from the setting of Theorem 7, in that the noises need not be Gaussian or from the the exponential family (the only assumption is whiteness), and the parameter is deterministic. One can show that SMD with decaying step size indeed converges to , under suitable conditions on the step size sequence.
Theorem 11**.**
Consider the model where , , and the are persistently exciting. The stochastic mirror descent iterates for any strongly convex potential , and a convex loss with a unique root at [math], converge to in a mean-square sense, if the the step size sequence satisfies .
The step size conditions are known as Robbins–Monro [22] conditions.
V Conclusion
In this paper, we reviewed several fundamental properties of stochastic mirror descent (SMD) family of algorithms, and provided a new stochastic interpretation of them, namely, that they are risk-sensitive optimal. The result generalizes a known result in the literature about the special case of SGD (aka LMS). Our analysis inspired a new algorithm, which is a “more symmetric” variant of SMD. Future work may concern studying this new algorithm and its convergence properties in more detail.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Nemirovski and D. B. Yudin, “Problem complexity and method efficiency in optimization.” 1983.
- 2[2] A. Beck and M. Teboulle, “Mirror descent and nonlinear projected subgradient methods for convex optimization,” Operations Research Letters , vol. 31, no. 3, pp. 167–175, 2003.
- 3[3] N. Cesa-Bianchi, P. Gaillard, G. Lugosi, and G. Stoltz, “Mirror descent meets fixed share (and feels no regret),” in Advances in Neural Information Processing Systems , 2012, pp. 980–988.
- 4[4] Z. Zhou, P. Mertikopoulos, N. Bambos, S. Boyd, and P. W. Glynn, “Stochastic mirror descent in variationally coherent optimization problems,” in Advances in Neural Information Processing Systems , 2017, pp. 7043–7052.
- 5[5] A. Nedic and S. Lee, “On stochastic subgradient mirror-descent algorithm with weighted averaging,” SIAM Journal on Optimization , vol. 24, no. 1, pp. 84–107, 2014.
- 6[6] N. Azizan and B. Hassibi, “A characterization of stochastic mirror descent algorithms and their convergence properties,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2019.
- 7[7] M. Raginsky and J. Bouvrie, “Continuous-time stochastic mirror descent on a network: Variance reduction, consensus, convergence,” in 2012 IEEE 51st IEEE Conference on Decision and Control (CDC) . IEEE, 2012, pp. 6793–6800.
- 8[8] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on optimization , vol. 19, no. 4, pp. 1574–1609, 2009.
