Generalized Deterministic Perturbations For Stochastic Gradient Search
K. Chandramouli, K. J. Prabuchandran, D. Sai Koti Reddy, and Shalabh, Bhatnagar

TL;DR
This paper characterizes and constructs optimal deterministic perturbation sequences for stochastic gradient search, improving the RDKW algorithm's performance and convergence over random and previously known deterministic methods.
Contribution
It introduces a generalized class of deterministic perturbations for RDKW, including a sequence with minimal cycle length, and proves convergence.
Findings
Proposed deterministic sequence outperforms Hadamard and random perturbations in simulations.
Established convergence of RDKW with the new class of deterministic perturbations.
Expanded the set of deterministic perturbations available for stochastic gradient search.
Abstract
Stochastic optimization (SO) considers the problem of optimizing an objective function in the presence of noise. Most of the solution techniques in SO estimate gradients from the noise corrupted observations of the objective and adjust parameters of the objective along the direction of the estimated gradients to obtain locally optimal solutions. Two prominent algorithms in SO namely Random Direction Kiefer-Wolfowitz (RDKW) and Simultaneous Perturbation Stochastic Approximation (SPSA) obtain noisy gradient estimate by randomly perturbing all the parameters simultaneously. This forces the search direction to be random in these algorithms and causes them to suffer additional noise on top of the noise incurred from the samples of the objective. Owing to this additional noise, the idea of using deterministic perturbations instead of random perturbations for gradient estimation has also been…
| Noise parameter | |
| Method | NMSE |
| RDKW-2R | |
| RDKW-2H | |
| DSPKW-2C | |
| Noise parameter | |
| Method | NMSE |
| RDKW-2R | |
| RDKW-2H | |
| DSPKW-2C | |
| Noise parameter | |
| Method | NMSE |
| RDKW-2R | |
| RDKW-2H | |
| DSPKW-2C | |
| Noise parameter | |
| Method | NMSE |
| RDKW-2R | |
| RDKW-2H | |
| DSPKW-2C | |
| Noise parameter | |
| Method | NMSE |
| RDKW-1R | |
| RDKW-1H | |
| DSPKW-1C | |
| Noise parameter | |
| Method | NMSE |
| RDKW-1R | |
| RDKW-1H | |
| DSPKW-1C | |
| Noise parameter | |
| Method | NMSE |
| RDKW-1R | |
| RDKW-1H | |
| DSPKW-1C | |
| Noise parameter | |
| Method | NMSE |
| RDKW-1R | |
| RDKW-1H | |
| DSPKW-1C | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMetaheuristic Optimization Algorithms Research · Stochastic Gradient Optimization Techniques · Neural Networks and Applications
Generalized Deterministic Perturbations For Stochastic Gradient Search
Chandramouli K.1, Prabuchandran K.J.12, D. Sai Koti Reddy3, and Shalabh Bhatnagar14 1 Department of Computer Science and Automation, Indian Institute of Science (IISc)2 Supported by Amazon-IISc Postdoctoral fellowship3 IBM Research, Bangalore4 Robert Bosch Centre for Cyber-Physical Systems, IISc
Abstract
Stochastic optimization (SO) considers the problem of optimizing an objective function in the presence of noise. Most of the solution techniques in SO estimate gradients from the noise corrupted observations of the objective and adjust parameters of the objective along the direction of the estimated gradients to obtain locally optimal solutions. Two prominent algorithms in SO namely Random Direction Kiefer-Wolfowitz (RDKW) and Simultaneous Perturbation Stochastic Approximation (SPSA) obtain noisy gradient estimate by randomly perturbing all the parameters simultaneously. This forces the search direction to be random in these algorithms and causes them to suffer additional noise on top of the noise incurred from the samples of the objective. Owing to this additional noise, the idea of using deterministic perturbations instead of random perturbations for gradient estimation has also been studied. Two specific constructions of the deterministic perturbation sequence using lexicographical ordering and Hadamard matrices have been explored and encouraging results have been reported in the literature. In this paper, we characterize the class of deterministic perturbation sequences that can be utilized in the RDKW algorithm. This class expands the set of known deterministic perturbation sequences available in the literature. Using our characterization, we propose construction of a deterministic perturbation sequence that has the least cycle length among all deterministic perturbations. Through simulations we illustrate the performance gain of the proposed deterministic perturbation sequence in the RDKW algorithm over the Hadamard and the random perturbation counterparts. We also establish the convergence of the RDKW algorithm for the generalized class of deterministic perturbations.
I Introduction
Stochastic optimization (SO) problems frequently arise in engineering disciplines such as transportation systems, machine learning, service systems, manufacturing etc. Practical limitations, lack of model information and the large dimensionality of these problems prohibit analytic solutions to these problems. Simulation is often employed to evaluate the performance of the current parameters of the system. Simulating and evaluating the system’s performance is generally expensive and one is typically constrained by a simulation budget. In such scenarios, owing to the simulation budget one aims to drive the system to optimal parameter settings using as few simulations as possible.
Under the SO framework, we have a system that gives noise-corrupted feedback of the performance for the currently set parameters, i.e., given the system parameter vector , the feedback that is available is the noisy evaluation of the performance where is the noise term inherent in the system and denotes the expected performance of the system for the parameter . The pictorial description of such a system is shown in Figure 1. The objective in the SO problem then is to determine a parameter that gives the optimal expected performance of the system, i.e.,
[TABLE]
Analogous to solutions for deterministic optimization problems where the explicit analytic gradient of the objective function is used to adjust the parameters along the negative gradient directions, many of the solution approaches in SO mimic the familiar gradient descent algorithm. However, unlike the deterministic setting, the SO setting only has access to noise corrupted samples of the objective. Thus, in the SO setting, one essentially aims at estimating the gradient of the objective function using noisy cost samples. In the pioneering work by Kiefer and Wolfowitz [1], the gradient is estimated by approximating each of the partial derivatives using either a two-sided or a one-sided finite difference approximation (FDSA) algorithm. This algorithm requires objective function evaluations (or simulations) per iteration for the two-sided gradient approximation scheme and simulations per iteration for the one-sided scheme (for a -dimensional parameter problem, see [2]). As the number of simulations per iteration required for gradient estimation scales linearly with the dimension of the problem, FDSA algorithm is expensive to deploy under high-dimensional parameter settings.
In [3], Random Direction Kiefer-Wolfowitz (RDKW) algorithm that uses only two simulations per iteration for obtaining gradient estimates has been proposed. In the RDKW algorithm, all the parameters are randomly perturbed simultaneously using two parallel simulations and function evaluations at those perturbed parameters are used to obtain the gradient estimate. In the RDKW algorithm, the random perturbation vector as well as the random direction vector involved in estimating the gradient have been kept the same. For the choice of random direction (or perturbation), various distributions like spherical uniform distribution [3], uniform distribution [4], normal and Cauchy distribution [5], asymmetric Bernoulli [6] have been explored. The number of simulations required for estimating the gradients in the RDKW algorithm is significantly less compared to the FDSA algorithm and the algorithm is seen to perform empirically better than FDSA.
In a seminal work [7], the Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm that uses two simulations similar to RDKW has been proposed. Unlike the RDKW algorithm, SPSA employs different choices for parameter perturbations and the random direction of movement, in particular, the random perturbation direction and the random direction of movement have been chosen to be inverses of each other. In [7], symmetric Bernoulli distribution has been shown to be the best choice for random perturbations among all the distributions and the proposed SPSA scheme has been proven to perform asymptotically better compared to FDSA. In [8], a comprehensive comparative study of the stochastic optimization algorithms namely FDSA, RDKW and SPSA has been provided. Further, under a general third order cross derivative assumption on the loss function, RDKW with symmetric Bernoulli distribution has been shown to be the best choice for random directions. In [9], an example of a loss function that does not satisfy the third order cross derivative condition in [8] has been constructed. For such a loss function, it has been shown that the optimal distribution choice for random directions need not be symmetric Bernoulli.
In [3] and [10], to further reduce simulation cost per iteration, extensions of the RDKW and SPSA algorithms that estimate the gradient with only one simulation or measurement of the objective have been considered. However, it is observed that the one-simulation gradient estimate has higher bias compared to the two-simulation gradient estimate. In [11] and [12], deterministic conditions for the perturbation and noise sequences required to obtain almost sure convergence of the iterates have been discussed. In [13], to enhance the performance of one-sided SPSA scheme, deterministic perturbations based on lexicographical ordering and Hadamard matrices have been proposed. Further, the numerical results in [13], illustrate the benefit of Hadamard matrix based perturbation sequences as it has been shown to improve the performance of SPSA empirically for the case of one sided measurements. In [14], a unified view of both RDKW and SPSA is presented and a binary deterministic perturbation sequence using orthogonal arrays [15] for obtaining gradient estimate in both of the algorithms has been discussed.
In this paper, we generalize the class of deterministic perturbation sequences that can be utilized in the RDKW algorithm. Based on this characterization, we provide a construction of a deterministic perturbation sequence using a specially chosen circulant matrix. We empirically study the performance of the constructed sequence against the afore mentioned Hadamard matrix based deterministic perturbations and the randomized perturbations. We expect with our generalization the study of rate of convergence for the RDKW algorithm based on deterministic perturbation sequences would be possible. We now summarize our contributions:
- •
We generalize the class of deterministic perturbation sequences that can be applied in the RDKW algorithm.
- •
We provide a special construction of deterministic perturbation sequence with smaller cycle length compared to Hadamard perturbation sequence.
- •
We illustrate the performance gain of the proposed deterministic perturbations over the Hadamard matrix based perturbations as well as random perturbations.
- •
We prove the convergence of the RDKW algorithm for the class of deterministic perturbations.
II Conditions on Deterministic Perturbations
In this section, we describe the classical RDKW algorithm and motivate the necessary conditions that a deterministic perturbation sequence should satisfy for almost sure convergence of the iterates in the deterministic perturbation version of RDKW algorithm.
The standard RDKW algorithm iteratively updates the parameter vector along the direction of the negative estimated gradient, i.e.,
[TABLE]
where is the step-size that satisfies standard stochastic approximation conditions (see Assumption A2 in section IV) and is the estimate of the gradient of the objective function at the current parameter.
In the case of two-simulation RDKW algorithm, the gradient estimate at is obtained as
[TABLE]
where is the random perturbation direction chosen according to a specific probability distribution. The properties that the specific distribution on should satisfy can be obtained as explained below. The Taylor series expansion of around is given by
[TABLE]
From (4), the error between the estimate and the true gradient at can be obtained as
[TABLE]
Note that the term constitutes the bias in the gradient estimate. For the error estimate in (II) to be negligible, we require
[TABLE]
Here, the expectation is taken over the random perturbation distribution.
In the one-simulation version of the RDKW algorithm, the gradient estimate at is obtained as
[TABLE]
By analogous Taylor series argument, we obtain the error between the estimate and the true gradient as
[TABLE]
From (II), we require the following to hold in addition to (6) in the case of random perturbations for the one simulation version of RDKW algorithm, i.e.,
[TABLE]
For the random perturbations, , is any distribution that satisfies (6) and (9), the noise in the gradient estimates gets averaged asymptotically. An example distribution for is symmetric Bernoulli where each component of the perturbation vector is with equal probability.
From (6) and (9) clearly one is motivated to look for perturbations that satisfy similar properties. In what follows, the sequence of deterministic perturbations (that will be used in either (3) or (7)) will be denoted by and we require the following two properties to hold for the perturbation sequence for the almost sure convergence of the iterates to a local minima.
- P1.
Let For any there exists a such that and, 2. P2.
Remark 1**.**
The properties P1 and P2 are the deterministic analogues of (6) and (9). For the properties P1 and P2 to hold, it is sufficient to determine a finite sequence such that and and for , periodically cycle through this sequence, i.e., set . We will refer the length of the deterministic perturbation sequence as the cycle length.
III Construction Of Deterministic Perturbations
In section III-A, following Remark 1, we first characterize the finite sequences that satisfy properties P1 and P2 by providing a matrix equation whose solution gives the deterministic perturbations. In Section III-B, we then construct a specific sequence using a circulant matrix that has the least possible cycle length among all the deterministic perturbation sequences. Finally in section III-C, we completely describe the RDKW algorithm that uses the deterministic perturbation sequence constructed using the circulant matrix approach.
III-A Matrix condition for Deterministic Perturbations
The properties P1 and P2 can be satisfied individually. For example, to satisfy property P1, let and , the scaled canonical basis vectors, then . To satisfy property P2, consider any set of linearly dependent vectors . Then there exists scalars such that . Now for the choice the property P2, is trivially satisfied. A natural question would be to determine sequences that satisfy both the properties simultaneously.
To address this problem, let us consider a matrix as follows: Y:=\left[\begin{array}[]{cccc}\uparrow&\uparrow&\cdots&\uparrow\\ d_{1}&d_{2}&\cdots&d_{P}\\ \downarrow&\downarrow&\cdots&\downarrow\\ \end{array}\right]. Let be a dimension vector. The perturbations that satisfy properties P1 and P2 essentially solve the two matrix equations and . These equations can be compactly written in a single matrix equation as
[TABLE]
where X=\left[\begin{array}[]{c}u^{T}\\ Y\end{array}\right]. Note that and are the unknowns here.
It can observed from (10) that could be treated as a submatrix of a orthogonal matrix with the first row being , a vector. It has been shown in [13] that columns of Hadamard matrices satisfy properties P1 and P2 simultaneously with , i.e., is chosen as a submatrix of the Hadamard matrix. It is not in general clear if the equation (10) can be solved for a smaller .
Remark 2**.**
We note that similar analysis for matrix condition for the construction of deterministic perturbations for SPSA estimates involves solving the following matrix system. , and where is , is , is vector of ones, is vector of ones and denotes the Hadamard product of the matrices and . It is not clear how to solve for and due to the presence of Hadamard product in this system.
III-B Specific Perturbation Sequence Construction
In this section, our goal is to obtain a sequence with least cycle length. Using a simple matrix rank argument it can be shown that is at least . Thus, in what follows, we give a construction of deterministic perturbation sequence with cycle length . We first write
[TABLE]
where is a matrix and is any matrix with columns that sum to 1. Clearly satisfies property P2.
To satisfy property P1, i.e., is equivalent to
[TABLE]
Clearly construction of deterministic perturbations with smaller cycle length is equivalent to solving for with an appropriate choice of .
The simplest choice of with column sums being 1 is , a vector, thus . Let ( dimensional matrix)
[TABLE]
Observe that is a positive definite circulant matrix. Hence is well defined and the choice satisfies (11) and solves the system with , i.e.,
[TABLE]
The columns of finally give us the deterministic perturbations. We note that in general the computation of is and can be very expensive for large . However owing to the special structure of , using a Sherman-Morrison type result (see Lemma 1, Section IV), can be computed in time complexity.
III-C Gradient estimation
In this section, we present the RDKW algorithms that use the deterministic perturbation sequence constructed above in two-simulation and one-simulation gradient estimates of the objective. We denote the corresponding algorithms by DSPKW-2C and DSPKW-1C respectively.
Let denote a sequence of diminishing positive real numbers satisfying assumption A2. in section IV. Let , denote the noisy objective function evaluations at the perturbed parameters and respectively, i.e., and . We assume the noise terms are martingale difference noise sequence, where is the information conditioned on the past parameter values and martingale difference terms.
The two-simulation and one-simulation estimates of the gradient based on the observed noisy objective samples for the RDKW algorithm are respectively given by
[TABLE]
[TABLE]
respectively. Observe that in the two-sided estimate (14) we use two function samples and and the estimate in (15) uses only one function sample .
Now we briefly describe the DSPKW algorithm. Inputs to the DSPKW algorithm are randomly chosen initial point , diminishing sequences and satisfying assumption A2. and the matrix of deterministic perturbations chosen according to (13). In our algorithms, we iteratively choose the perturbations by cycling through columns of with period and in steps 2-4, we update the parameters along the direction of estimated gradient according to (14) in the DSPKW-2C algorithm and according to (15) in the DSPKW-1C algorithm. Note the choice of gradient estimate (or the algorithm) is dictated by the simulation budget given to us. The algorithms terminate by returning the parameter at the end of iterations.
IV Convergence Analysis
In this section we first provide a few lemmas that assist in computing the proposed deterministic perturbation sequence (see (13) in Section III-B). In the latter part of the section, we prove the almost sure convergence of the iterates for the class of deterministic perturbations characterized in Section III-A.
The following lemma is useful in obtaining the negative square root of , i.e., in a computationally efficient manner. Also note that it takes only operations to compute using the lemma and the circulant structure of . Note that the following lemma could also be utilized in an independent context for efficient computation.
Lemma 1**.**
*Let be a identity matrix and
be a column vector of 1s, then*
[TABLE]
Proof.
It is enough to show that
[TABLE]
Using in the expansion of \Big{[}I-\frac{uu^{T}}{p}+\frac{uu^{T}}{p\sqrt{(1+p)}}\Big{]}^{2} gives the result. ∎
Let be defined as in (12) and Let the perturbations be the columns of
Lemma 2**.**
The perturbations chosen as columns of Y satisfy properties P1 and P2.
Proof.
It easily follows from the discussion in section III-B on the construction of this specific perturbation sequence. ∎
In what follows, we prove the almost sure convergence of the iterates in the DSPKW algorithm (Section III-C) under the following assumptions. Note that denotes the 2-norm.
- A1.
The map is Lipschitz continuous and is differentiable with bounded second order derivatives. Further, the map defined as is Lipschitz continuous. 2. A2.
The step-size sequences satisfy
[TABLE]
Further, as , for all for any given and is such that as , for all 3. A3.
. 4. A4.
The iterates remain uniformly bounded almost surely, i.e., 5. A5.
The ODE has a compact set as its set of asymptotically stable equilibria (i.e., the set of local minima of is compact). 6. A6.
The sequences form martingale difference sequences. Further, are square integrable random variables satisfying
[TABLE]
for a given constant
Remark 3**.**
Assumptions A1, A2 and A5 are standard stochastic approximation conditions. Assumption A3 trivially follows from Remark 1. Assumption A4 is the stability condition on the iterates and holds in many applications [7] (see the discussion in pp 40-41 of [3]). This condition can also be enforced by projecting the iterates into a compact set, however, the iterates converge to a limiting set that contains all possible limit points (see pp.191 in [3]). Assumption A6 gives the condition on the maximum strength of the martingale difference noise under which convergence of the iterates could be ensured and in many stochastic optimization settings this condition could be easily verified using Jensen’s inequality and Lipschitz continuity of .
The following two lemmas aid in the proof of almost sure convergence of the iterates in the DSPKW algorithm.
Lemma 3**.**
Given any fixed integer , as for all
Proof.
Fix a Now
[TABLE]
where . Thus,
[TABLE]
Now clearly, forms a martingale sequence with respect to the filtration . Further, from the assumption (A6) we have,
[TABLE]
From the assumption (A4), the quadratic variation process of converges almost surely. Hence by the martingale convergence theorem, it follows that converges almost surely. Hence almost surely as Moreover
[TABLE]
since Note that
[TABLE]
where is the Lipschitz constant of the function Hence,
[TABLE]
for max Similarly,
[TABLE]
From assumption (A1), it follows that
[TABLE]
for some Thus,
proving the lemma. ∎
Lemma 4**.**
* \Big{\|}\sum\limits_{n=m}^{m+P-1}\frac{a_{n}}{a_{m}}D_{n}\nabla J(\theta_{n})\Big{\|}\text{ and } \Big{\|}\sum\limits_{n=m}^{m+P-1}\frac{b_{n}}{b_{m}}d_{n}J(\theta_{n})\Big{\|}\rightarrow 0, *
Proof.
From Lemma 3, it can be seen that as for all Also, from assumption (A1), we have as for all Now from Lemma 2, Hence Consider first
[TABLE]
from assumptions (A1) and (A2). Now observe that as for all as a consequence of (A1) and Lemma 3. Moreover from we have
[TABLE]
The claim now follows as a consequence of assumptions (A1) and (A2). ∎
Finally, using the following theorems, we conclude the analysis by proving the almost sure convergence of the iterates to the set of local minima of the function
Theorem 5**.**
* obtained from DSPKW-2C satisfy almost surely.*
Proof.
Note that
[TABLE]
It follows that
[TABLE]
Now the fourth term on the RHS above can be written as
[TABLE]
where from Lemma 4. Thus, the algorithm is asymptotically analogous to
[TABLE]
Hence, from Theorem 2 in chapter 2 of [borkar2008stochastic], it follows that converge to a local minima of the function ∎
Theorem 6**.**
* obtained from DSPKW-1C satisfy almost surely.*
Proof.
Note that
[TABLE]
It follows that
[TABLE]
Now we observe that the third term on the RHS above is
[TABLE]
where by Lemma 4. Similarly
[TABLE]
with by Lemma 4. The rest follows as in Theorem 5. ∎
V Simulation Experiments
In this section, we compare the numerical performance of our DSPKW-2C algorithm against the RDKW algorithm that uses random Bernoulli perturbations and another variant of the RDKW algorithm that uses Hadamard matrix based deterministic perturbations. We refer them by the acronyms RDKW-2R and RDKW-2H respectively. In a similar manner, we also compare DSPKW-1C algorithm against the one-simulation variants RDKW-1R and RDKW-1H. Note that 2 or 1 in the acronyms of these algorithms denote the number of simulations utilized per iteration.111The implementation is available at https://github.com/cs1070166/1RDSA-2Cand1RDSA-1C/
V-A Experimental setup
For the empirical performance evaluation, we consider the following two loss functions:
Quadratic loss
[TABLE]
Fourth-order loss
[TABLE]
In the loss functions considered above, we set the dimension . We choose such that is an upper triangular matrix with each nonzero entry equal to one and is a -dimensional vector of ones. In our experiments, we follow the same noise assumptions considered in [16], i.e., for any , the additive noise in the objective is given by where . In all algorithms, we set the step-size schedule as and with and . Note that the chosen values for and have demonstrated good finite-sample performance empirically, while satisfying the theoretical requirements needed for asymptotic convergence (see [16]. We set the same initial point for all the algorithms.
We consider two settings in our experiments. In the first noise-free setting, we do not add any noise to the objective function evaluations and in the second setting, we corrupt the function evaluations by adding noise (with variance parameter as described above). We evaluate the performance of these algorithms based on Normalized Mean Square Error (NMSE) metric. NMSE is defined as the ratio , where is the parameter returned by the algorithm.
V-B Discussion of Results
The performance comparisons of all the algorithms based on NMSE values are summarized in Tables I, II, III and IV. In the tables, we have highlighted the algorithm that has the minimum NMSE. We summarize our findings:
- •
Even in the absence of noise, due to the random directions chosen by RDKW-2R and RDKW-1R algorithms, the standard deviation is significantly high compared to the corresponding deterministic counterparts.
- •
We would like to emphasize that the quality of the solution (characterized by standard deviation) is significantly better for the case of proposed deterministic perturbations compared to the existing Hadamard based deterministic perturbations and random perturbations. Note however that we do not make comparisons between two-simulation and one-simulation algorithms.
- •
In the case of two simulation algorithms (see Tables I and II), DSPKW-2C performs marginally better than RDKW-2H, while both of them outperform RDKW-2R significantly.
- •
In the case of one simulation algorithms (see Tables III and IV), DSPKW-1C performs better than both RDKW-1H and RDKW-1R.
VI Conclusions
We have generalized the deterministic perturbation sequences from lexicographical ordering and Hadamard matrix based constructions for the RDKW algorithm and presented a novel construction of deterministic perturbations that has least cycle length within the class of deterministic perturbation sequences. Further, we have proved the almost sure convergence of the iterates for the class of deterministic perturbation sequences. Now that we have a characterization of the class of deterministic perturbation sequences, it would be interesting as future work, to theoretically study and compare the rate of convergence of deterministic perturbation algorithms against their random perturbation counterparts. A challenging future direction would be to study the asymptotic normality or weak convergence of the iterates. It would also be interesting to similarly characterize the class of deterministic perturbation sequences for the SPSA algorithm.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. Kiefer and J. Wolfowitz, “Stochastic estimation of the maximum of a regression function,” Ann. Math. Statist. , vol. 23, no. 3, pp. 462–466, 09 1952. [Online]. Available: http://dx.doi.org/10.1214/aoms/1177729392 · doi ↗
- 2[2] S. Bhatnagar, H. L. Prasad, and L. A. Prashanth, Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods (Lecture Notes in Control and Information Sciences) . Springer, 2013, vol. 434.
- 3[3] H. J. Kushner and D. S. Clark, Stochastic Approximation Methods for Constrained and Unconstrained Systems . Springer Verlag, 1978.
- 4[4] Y. M. Ermol’Ev, “On the method of generalized stochastic gradients and quasi-fejér sequences,” Cybernetics , vol. 5, no. 2, pp. 208–220, 1969.
- 5[5] M. Styblinski and T.-S. Tang, “Experiments in nonconvex optimization: stochastic approximation with function smoothing and simulated annealing,” Neural Networks , vol. 3, no. 4, pp. 467–483, 1990.
- 6[6] L. Prashanth, S. Bhatnagar, M. Fu, and S. Marcus, “Adaptive system optimization using random directions stochastic approximation,” IEEE Transactions on Automatic Control , vol. 62, no. 5, pp. 2223–2238, 2017.
- 7[7] J. C. Spall, “Multivariate stochastic approximation using a simultaneous perturbation gradient approximation,” IEEE Trans. Auto. Cont. , vol. 37, no. 3, pp. 332–341, 1992.
- 8[8] D. C. Chin, “Comparative study of stochastic algorithms for system optimization based on gradient approximations,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics , vol. 27, no. 2, pp. 244–249, 1997.
