Generalized Deterministic Perturbations For Stochastic Gradient Search

K. Chandramouli; K. J. Prabuchandran; D. Sai Koti Reddy; and Shalabh; Bhatnagar

arXiv:1702.06250·cs.SY·August 3, 2018

Generalized Deterministic Perturbations For Stochastic Gradient Search

K. Chandramouli, K. J. Prabuchandran, D. Sai Koti Reddy, and Shalabh, Bhatnagar

PDF

Open Access

TL;DR

This paper characterizes and constructs optimal deterministic perturbation sequences for stochastic gradient search, improving the RDKW algorithm's performance and convergence over random and previously known deterministic methods.

Contribution

It introduces a generalized class of deterministic perturbations for RDKW, including a sequence with minimal cycle length, and proves convergence.

Findings

01

Proposed deterministic sequence outperforms Hadamard and random perturbations in simulations.

02

Established convergence of RDKW with the new class of deterministic perturbations.

03

Expanded the set of deterministic perturbations available for stochastic gradient search.

Abstract

Stochastic optimization (SO) considers the problem of optimizing an objective function in the presence of noise. Most of the solution techniques in SO estimate gradients from the noise corrupted observations of the objective and adjust parameters of the objective along the direction of the estimated gradients to obtain locally optimal solutions. Two prominent algorithms in SO namely Random Direction Kiefer-Wolfowitz (RDKW) and Simultaneous Perturbation Stochastic Approximation (SPSA) obtain noisy gradient estimate by randomly perturbing all the parameters simultaneously. This forces the search direction to be random in these algorithms and causes them to suffer additional noise on top of the noise incurred from the samples of the objective. Owing to this additional noise, the idea of using deterministic perturbations instead of random perturbations for gradient estimation has also been…

Tables4

Table 1. TABLE I : NMSE values of two-simulation methods for the quadratic objective ( 16 ) without and with noise for 2000 simulations: standard deviation of 100 100 100 replications shown after ± plus-or-minus \pm symbol

Noise parameter $σ = 0$
Method	NMSE
RDKW-2R	$5.755 \times 10^{- 3} \pm 2.460 \times 10^{- 3}$
RDKW-2H	$1.601 \times 10^{- 5} \pm 2.724 \times 10^{- 20}$
DSPKW-2C	$2.474 \times 10^{- 8} \pm 1.995 \times 10^{- 23}$
Noise parameter $σ = 0.01$
Method	NMSE
RDKW-2R	$5.762 \times 10^{- 3} \pm 2.473 \times 10^{- 3}$
RDKW-2H	$4.012 \times 10^{- 5} \pm 1.654 \times 10^{- 5}$
DSPKW-2C	$2.188 \times 10^{- 5} \pm 9.908 \times 10^{- 6}$

Table 2. TABLE II : NMSE values of two-simulation methods for the fourth order objective ( 17 ) without and with noise for 10000 simulations: standard deviation of 100 100 100 replications shown after ± plus-or-minus \pm symbol

Noise parameter $σ = 0$
Method	NMSE
RDKW-2R	$2.747 \times 10^{- 2} \pm 1.413 \times 10^{- 2}$
RDKW-2H	$3.901 \times 10^{- 3} \pm 4.359 \times 10^{- 18}$
DSPKW-2C	$3.535 \times 10^{- 3} \pm 1.743 \times 10^{- 18}$
Noise parameter $σ = 0.01$
Method	NMSE
RDKW-2R	$2.762 \times 10^{- 2} \pm 1.415 \times 10^{- 2}$
RDKW-2H	$3.958 \times 10^{- 3} \pm 4.227 \times 10^{- 4}$
DSPKW-2C	$3.598 \times 10^{- 3} \pm 4.158 \times 10^{- 4}$

Table 3. TABLE III : NMSE values of one-simulation methods for the quadratic objective ( 16 ) without and with noise for 20000 simulations: standard deviation of 100 100 100 replications shown after ± plus-or-minus \pm symbol

Noise parameter $σ = 0$
Method	NMSE
RDKW-1R	$8.584 \times 10^{- 2} \pm 3.681 \times 10^{- 2}$
RDKW-1H	$2.770 \times 10^{- 2} \pm 3.836 \times 10^{- 17}$
DSPKW-1C	$8.225 \times 10^{- 3} \pm 1.569 \times 10^{- 17}$
Noise parameter $σ = 0.01$
Method	NMSE
RDKW-1R	$8.582 \times 10^{- 2} \pm 3.691 \times 10^{- 2}$
RDKW-1H	$2.774 \times 10^{- 2} \pm 2.578 \times 10^{- 4}$
DSPKW-1C	$8.225 \times 10^{- 3} \pm 5.959 \times 10^{- 5}$

Table 4. TABLE IV : NMSE values of one-simulation methods for the fourth order objective ( 17 ) without and with noise for 20000 simulations: standard deviation of of 100 100 100 replications shown after ± plus-or-minus \pm symbol

Noise parameter $σ = 0$
Method	NMSE
RDKW-1R	$3.192 \times 10^{- 1} \pm 1.991 \times 10^{- 1}$
RDKW-1H	$8.173 \times 10^{- 2} \pm 1.255 \times 10^{- 16}$
DSPKW-1C	$4.403 \times 10^{- 2} \pm 9.066 \times 10^{- 17}$
Noise parameter $σ = 0.01$
Method	NMSE
RDKW-1R	$3.240 \times 10^{- 1} \pm 1.836 \times 10^{- 1}$
RDKW-1H	$8.916 \times 10^{- 2} \pm 1.896 \times 10^{- 2}$
DSPKW-1C	$4.972 \times 10^{- 2} \pm 9.812 \times 10^{- 3}$

Equations102

θ^{*} = ar g θ \in R^{p} min J (θ) .

θ^{*} = ar g θ \in R^{p} min J (θ) .

θ_{n + 1} = θ_{n} - a_{n} \nabla J (θ_{n}),

θ_{n + 1} = θ_{n} - a_{n} \nabla J (θ_{n}),

\nabla J (θ) = \frac{J ( θ + δ d ) - J ( θ - δ d )}{2 δ} d,

\nabla J (θ) = \frac{J ( θ + δ d ) - J ( θ - δ d )}{2 δ} d,

J (θ \pm δ d) = J (θ) \pm δ d^{T} \nabla J (θ) + o (δ^{2}) .

J (θ \pm δ d) = J (θ) \pm δ d^{T} \nabla J (θ) + o (δ^{2}) .

\frac{J ( θ + δ d ) - J ( θ - δ d )}{2 δ} d - \nabla J (θ)

\frac{J ( θ + δ d ) - J ( θ - δ d )}{2 δ} d - \nabla J (θ)

= (d d^{T} - I) \nabla J (θ) + o (δ) .

\mathbb{E}\Big{[}dd^{T}\Big{]}=I.

\mathbb{E}\Big{[}dd^{T}\Big{]}=I.

\nabla J (θ) = \frac{J ( θ + δ d )}{δ} d .

\nabla J (θ) = \frac{J ( θ + δ d )}{δ} d .

\frac{J ( θ + δ d )}{δ} d - \nabla J (θ)

\frac{J ( θ + δ d )}{δ} d - \nabla J (θ)

= \frac{J ( θ )}{δ} d + (d d^{T} - I) \nabla J (θ) + O (δ) .

E [d] = 0.

E [d] = 0.

X X^{T} = P I_{(p + 1) \times (p + 1)},

X X^{T} = P I_{(p + 1) \times (p + 1)},

Y=\left[\begin{array}[]{ccc}\uparrow&\cdots&\uparrow\\ Z&&-ZU\\ \downarrow&\cdots&\downarrow\\ \end{array}\right]

Y=\left[\begin{array}[]{ccc}\uparrow&\cdots&\uparrow\\ Z&&-ZU\\ \downarrow&\cdots&\downarrow\\ \end{array}\right]

Z Z^{T} + Z U U^{T} Z^{T} = Z (I + U U^{T}) Z^{T} = P I .

Z Z^{T} + Z U U^{T} Z^{T} = Z (I + U U^{T}) Z^{T} = P I .

C=\left[\begin{array}[]{cccc}2\ 1\ 1\cdots 1\\ 1\ 2\ 1\cdots 1\\ \vdots\ \vdots\ \vdots\ \vdots\\ 1\ 1\ 1\cdots 2\end{array}\right].

C=\left[\begin{array}[]{cccc}2\ 1\ 1\cdots 1\\ 1\ 2\ 1\cdots 1\\ \vdots\ \vdots\ \vdots\ \vdots\\ 1\ 1\ 1\cdots 2\end{array}\right].

Y = p + 1 [C^{- 1/2}, - C^{- 1/2} u] .

Y = p + 1 [C^{- 1/2}, - C^{- 1/2} u] .

Y = p + 1 [C^{- 1/2}, - C^{- 1/2} u],

Y = p + 1 [C^{- 1/2}, - C^{- 1/2} u],

θ_{n + 1} = θ_{n} - a_{n} \nabla J (θ_{n})

θ_{n + 1} = θ_{n} - a_{n} \nabla J (θ_{n})

\nabla J (θ_{n}) = [\frac{( y _{n}^{+} - y _{n}^{-} ) d _{n}}{2 δ _{n}}],

\nabla J (θ_{n}) = [\frac{( y _{n}^{+} - y _{n}^{-} ) d _{n}}{2 δ _{n}}],

\nabla J (θ_{n}) = [\frac{( y _{n}^{+} ) d _{n}}{δ _{n}}],

\nabla J (θ_{n}) = [\frac{( y _{n}^{+} ) d _{n}}{δ _{n}}],

(I + u u^{T})^{- 1/2} = I - \frac{u u ^{T}}{p} + \frac{u u ^{T}}{p ( 1 + p )} .

(I + u u^{T})^{- 1/2} = I - \frac{u u ^{T}}{p} + \frac{u u ^{T}}{p ( 1 + p )} .

(I+uu^{T})\Bigg{[}I-\frac{uu^{T}}{p}+\frac{uu^{T}}{p\sqrt{(1+p)}}\Bigg{]}^{2}=I.

(I+uu^{T})\Bigg{[}I-\frac{uu^{T}}{p}+\frac{uu^{T}}{p\sqrt{(1+p)}}\Bigg{]}^{2}=I.

a_{n},\delta_{n}\rightarrow 0,\sum_{n}a_{n}=\infty,\sum_{n}\Big{(}\frac{a_{n}}{\delta_{n}}\Big{)}^{2}<\infty.

a_{n},\delta_{n}\rightarrow 0,\sum_{n}a_{n}=\infty,\sum_{n}\Big{(}\frac{a_{n}}{\delta_{n}}\Big{)}^{2}<\infty.

E [∥ M_{n + 1}^{\pm} ∥^{2} ∣ F_{n}] \leq K (1 + ∥ θ_{n} ∥^{2}) a.s., \forall n \geq 0,

E [∥ M_{n + 1}^{\pm} ∥^{2} ∣ F_{n}] \leq K (1 + ∥ θ_{n} ∥^{2}) a.s., \forall n \geq 0,

\displaystyle\begin{split}\theta_{n+k}=\theta_{n}&-\sum_{j=n}^{n+k-1}a_{j}\Bigg{(}\frac{J(\theta_{j}+\delta_{j}d_{j})-J(\theta_{j}-\delta_{j}d_{j})}{2\delta_{j}}\Bigg{)}d_{j}\\ &-\sum_{j=n}^{n+k-1}a_{j}M_{j+1},\end{split}

\displaystyle\begin{split}\theta_{n+k}=\theta_{n}&-\sum_{j=n}^{n+k-1}a_{j}\Bigg{(}\frac{J(\theta_{j}+\delta_{j}d_{j})-J(\theta_{j}-\delta_{j}d_{j})}{2\delta_{j}}\Bigg{)}d_{j}\\ &-\sum_{j=n}^{n+k-1}a_{j}M_{j+1},\end{split}

\displaystyle\begin{split}\|\theta_{n+k}-\theta_{n}\|&\leq\sum_{j=n}^{n+k-1}a_{j}\Bigg{|}\frac{J(\theta_{j}+\delta_{j}d_{j})-J(\theta_{j}-\delta_{j}d_{j})}{2\delta_{j}}\Bigg{|}\|d_{j}\|\\ &+\sum_{j=n}^{n+k-1}a_{j}\|M_{j+1}\|.\end{split}

\displaystyle\begin{split}\|\theta_{n+k}-\theta_{n}\|&\leq\sum_{j=n}^{n+k-1}a_{j}\Bigg{|}\frac{J(\theta_{j}+\delta_{j}d_{j})-J(\theta_{j}-\delta_{j}d_{j})}{2\delta_{j}}\Bigg{|}\|d_{j}\|\\ &+\sum_{j=n}^{n+k-1}a_{j}\|M_{j+1}\|.\end{split}

m = 0 \sum n E [∥ N_{m + 1} - N_{m} ∥^{2} ∣ F_{m}]

m = 0 \sum n E [∥ N_{m + 1} - N_{m} ∥^{2} ∣ F_{m}]

\leq m = 0 \sum n a_{m}^{2} K (1 + ∥ θ_{m} ∥^{2}) .

\displaystyle\Big{\|}\Big{(}J(\theta_{j}+\delta_{j}d_{j})-J(\theta_{j}-\delta_{j}d_{j})\Big{)}d_{j}\Big{\|}

\displaystyle\Big{\|}\Big{(}J(\theta_{j}+\delta_{j}d_{j})-J(\theta_{j}-\delta_{j}d_{j})\Big{)}d_{j}\Big{\|}

\displaystyle\leq\Big{|}\Big{(}J(\theta_{j}+\delta_{j}d_{j})-J(\theta_{j}-\delta_{j}d_{j})\Big{)}\Big{|}\|d_{j}\|

\displaystyle\leq K_{0}\Big{(}|J(\theta_{j}+\delta_{j}d_{j})|+|J(\theta_{j}-\delta_{j}d_{j})|\Big{)},

∣ J (θ_{j} + δ_{j} d_{j}) ∣ - ∣ J (0) ∣

∣ J (θ_{j} + δ_{j} d_{j}) ∣ - ∣ J (0) ∣

\leq \hat{B} ∥ θ_{j} + δ_{j} d_{j} ∥,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMetaheuristic Optimization Algorithms Research · Stochastic Gradient Optimization Techniques · Neural Networks and Applications

Full text

Generalized Deterministic Perturbations For Stochastic Gradient Search

Chandramouli K.1, Prabuchandran K.J.12, D. Sai Koti Reddy3, and Shalabh Bhatnagar14 1 Department of Computer Science and Automation, Indian Institute of Science (IISc)2 Supported by Amazon-IISc Postdoctoral fellowship3 IBM Research, Bangalore4 Robert Bosch Centre for Cyber-Physical Systems, IISc

Abstract

Stochastic optimization (SO) considers the problem of optimizing an objective function in the presence of noise. Most of the solution techniques in SO estimate gradients from the noise corrupted observations of the objective and adjust parameters of the objective along the direction of the estimated gradients to obtain locally optimal solutions. Two prominent algorithms in SO namely Random Direction Kiefer-Wolfowitz (RDKW) and Simultaneous Perturbation Stochastic Approximation (SPSA) obtain noisy gradient estimate by randomly perturbing all the parameters simultaneously. This forces the search direction to be random in these algorithms and causes them to suffer additional noise on top of the noise incurred from the samples of the objective. Owing to this additional noise, the idea of using deterministic perturbations instead of random perturbations for gradient estimation has also been studied. Two specific constructions of the deterministic perturbation sequence using lexicographical ordering and Hadamard matrices have been explored and encouraging results have been reported in the literature. In this paper, we characterize the class of deterministic perturbation sequences that can be utilized in the RDKW algorithm. This class expands the set of known deterministic perturbation sequences available in the literature. Using our characterization, we propose construction of a deterministic perturbation sequence that has the least cycle length among all deterministic perturbations. Through simulations we illustrate the performance gain of the proposed deterministic perturbation sequence in the RDKW algorithm over the Hadamard and the random perturbation counterparts. We also establish the convergence of the RDKW algorithm for the generalized class of deterministic perturbations.

I Introduction

Stochastic optimization (SO) problems frequently arise in engineering disciplines such as transportation systems, machine learning, service systems, manufacturing etc. Practical limitations, lack of model information and the large dimensionality of these problems prohibit analytic solutions to these problems. Simulation is often employed to evaluate the performance of the current parameters of the system. Simulating and evaluating the system’s performance is generally expensive and one is typically constrained by a simulation budget. In such scenarios, owing to the simulation budget one aims to drive the system to optimal parameter settings using as few simulations as possible.

Under the SO framework, we have a system that gives noise-corrupted feedback of the performance for the currently set parameters, i.e., given the system parameter vector $\theta$ , the feedback that is available is the noisy evaluation $h(\theta,\xi)$ of the performance $J(\theta)=\mathbb{E}_{\xi}[h(\theta,\xi)]$ where $\xi$ is the noise term inherent in the system and $J(\theta)$ denotes the expected performance of the system for the parameter $\theta$ . The pictorial description of such a system is shown in Figure 1. The objective in the SO problem then is to determine a parameter $\theta^{*}$ that gives the optimal expected performance of the system, i.e.,

[TABLE]

Analogous to solutions for deterministic optimization problems where the explicit analytic gradient of the objective function is used to adjust the parameters along the negative gradient directions, many of the solution approaches in SO mimic the familiar gradient descent algorithm. However, unlike the deterministic setting, the SO setting only has access to noise corrupted samples of the objective. Thus, in the SO setting, one essentially aims at estimating the gradient of the objective function using noisy cost samples. In the pioneering work by Kiefer and Wolfowitz [1], the gradient is estimated by approximating each of the partial derivatives using either a two-sided or a one-sided finite difference approximation (FDSA) algorithm. This algorithm requires $2p$ objective function evaluations (or simulations) per iteration for the two-sided gradient approximation scheme and $p+1$ simulations per iteration for the one-sided scheme (for a $p$ -dimensional parameter problem, see [2]). As the number of simulations per iteration required for gradient estimation scales linearly with the dimension of the problem, FDSA algorithm is expensive to deploy under high-dimensional parameter settings.

In [3], Random Direction Kiefer-Wolfowitz (RDKW) algorithm that uses only two simulations per iteration for obtaining gradient estimates has been proposed. In the RDKW algorithm, all the parameters are randomly perturbed simultaneously using two parallel simulations and function evaluations at those perturbed parameters are used to obtain the gradient estimate. In the RDKW algorithm, the random perturbation vector as well as the random direction vector involved in estimating the gradient have been kept the same. For the choice of random direction (or perturbation), various distributions like spherical uniform distribution [3], uniform distribution [4], normal and Cauchy distribution [5], asymmetric Bernoulli [6] have been explored. The number of simulations required for estimating the gradients in the RDKW algorithm is significantly less compared to the FDSA algorithm and the algorithm is seen to perform empirically better than FDSA.

In a seminal work [7], the Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm that uses two simulations similar to RDKW has been proposed. Unlike the RDKW algorithm, SPSA employs different choices for parameter perturbations and the random direction of movement, in particular, the random perturbation direction and the random direction of movement have been chosen to be inverses of each other. In [7], symmetric Bernoulli distribution has been shown to be the best choice for random perturbations among all the distributions and the proposed SPSA scheme has been proven to perform asymptotically better compared to FDSA. In [8], a comprehensive comparative study of the stochastic optimization algorithms namely FDSA, RDKW and SPSA has been provided. Further, under a general third order cross derivative assumption on the loss function, RDKW with symmetric Bernoulli distribution has been shown to be the best choice for random directions. In [9], an example of a loss function that does not satisfy the third order cross derivative condition in [8] has been constructed. For such a loss function, it has been shown that the optimal distribution choice for random directions need not be symmetric Bernoulli.

In [3] and [10], to further reduce simulation cost per iteration, extensions of the RDKW and SPSA algorithms that estimate the gradient with only one simulation or measurement of the objective have been considered. However, it is observed that the one-simulation gradient estimate has higher bias compared to the two-simulation gradient estimate. In [11] and [12], deterministic conditions for the perturbation and noise sequences required to obtain almost sure convergence of the iterates have been discussed. In [13], to enhance the performance of one-sided SPSA scheme, deterministic perturbations based on lexicographical ordering and Hadamard matrices have been proposed. Further, the numerical results in [13], illustrate the benefit of Hadamard matrix based perturbation sequences as it has been shown to improve the performance of SPSA empirically for the case of one sided measurements. In [14], a unified view of both RDKW and SPSA is presented and a binary deterministic perturbation sequence using orthogonal arrays [15] for obtaining gradient estimate in both of the algorithms has been discussed.

In this paper, we generalize the class of deterministic perturbation sequences that can be utilized in the RDKW algorithm. Based on this characterization, we provide a construction of a deterministic perturbation sequence using a specially chosen circulant matrix. We empirically study the performance of the constructed sequence against the afore mentioned Hadamard matrix based deterministic perturbations and the randomized perturbations. We expect with our generalization the study of rate of convergence for the RDKW algorithm based on deterministic perturbation sequences would be possible. We now summarize our contributions:

•

We generalize the class of deterministic perturbation sequences that can be applied in the RDKW algorithm.

•

We provide a special construction of deterministic perturbation sequence with smaller cycle length compared to Hadamard perturbation sequence.

•

We illustrate the performance gain of the proposed deterministic perturbations over the Hadamard matrix based perturbations as well as random perturbations.

•

We prove the convergence of the RDKW algorithm for the class of deterministic perturbations.

II Conditions on Deterministic Perturbations

In this section, we describe the classical RDKW algorithm and motivate the necessary conditions that a deterministic perturbation sequence should satisfy for almost sure convergence of the iterates in the deterministic perturbation version of RDKW algorithm.

The standard RDKW algorithm iteratively updates the parameter vector along the direction of the negative estimated gradient, i.e.,

[TABLE]

where $a_{n}$ is the step-size that satisfies standard stochastic approximation conditions (see Assumption A2 in section IV) and $\widehat{\nabla J}$ is the estimate of the gradient of the objective function $J$ at the current parameter.

In the case of two-simulation RDKW algorithm, the gradient estimate at $\theta$ is obtained as

[TABLE]

where $d$ is the random perturbation direction chosen according to a specific probability distribution. The properties that the specific distribution on $d$ should satisfy can be obtained as explained below. The Taylor series expansion of $J(\theta\pm\delta d)$ around $\theta$ is given by

[TABLE]

From (4), the error between the estimate and the true gradient at $\theta$ can be obtained as

[TABLE]

Note that the term $(dd^{T}-I)\nabla J(\theta)$ constitutes the bias in the gradient estimate. For the error estimate in (II) to be negligible, we require

[TABLE]

Here, the expectation $\mathbb{E}[\cdot]$ is taken over the random perturbation distribution.

In the one-simulation version of the RDKW algorithm, the gradient estimate at $\theta$ is obtained as

[TABLE]

By analogous Taylor series argument, we obtain the error between the estimate and the true gradient as

[TABLE]

From (II), we require the following to hold in addition to (6) in the case of random perturbations for the one simulation version of RDKW algorithm, i.e.,

[TABLE]

For the random perturbations, $d\sim F$ , $F$ is any distribution that satisfies (6) and (9), the noise in the gradient estimates gets averaged asymptotically. An example distribution for $F$ is symmetric Bernoulli where each component of the perturbation vector is $\pm 1$ with equal probability.

From (6) and (9) clearly one is motivated to look for perturbations that satisfy similar properties. In what follows, the sequence of deterministic perturbations (that will be used in either (3) or (7)) will be denoted by $\{d_{n}\}_{n\geq 1}$ and we require the following two properties to hold for the perturbation sequence $d_{n}$ for the almost sure convergence of the iterates to a local minima.

P1.

Let $D_{n}:=d_{n}d_{n}^{T}-I_{p\times p}.$ For any $s\in\mathbb{N}$ there exists a $P\in\mathbb{N}$ such that $\sum\limits_{n=s+1}^{s+P}D_{n}=0$ and, 2. P2.

$\sum\limits_{n=s+1}^{s+P}d_{n}=0.$

Remark 1.

The properties P1 and P2 are the deterministic analogues of (6) and (9). For the properties P1 and P2 to hold, it is sufficient to determine a finite sequence $\{d_{1},d_{2},\dots,d_{P}\}$ such that $\sum_{n=1}^{P}d_{n}d_{n}^{T}=PI$ and $\sum_{n=1}^{P}d_{n}=0$ and for $n\geq P+1$ , periodically cycle through this sequence, i.e., set $d_{n}=d_{n\%P+1}$ . We will refer the length of the deterministic perturbation sequence $P$ as the cycle length.

III Construction Of Deterministic Perturbations

In section III-A, following Remark 1, we first characterize the finite sequences $\{d_{1},d_{2},\dots,d_{P}\}$ that satisfy properties P1 and P2 by providing a matrix equation whose solution gives the deterministic perturbations. In Section III-B, we then construct a specific sequence using a circulant matrix that has the least possible cycle length among all the deterministic perturbation sequences. Finally in section III-C, we completely describe the RDKW algorithm that uses the deterministic perturbation sequence constructed using the circulant matrix approach.

III-A Matrix condition for Deterministic Perturbations

The properties P1 and P2 can be satisfied individually. For example, to satisfy property P1, let $P=p$ and $d_{n}=\sqrt{p}e_{n},~{}n\in\{1,\ldots,P\}$ , the scaled canonical basis vectors, then $\sum_{n=1}^{P}d_{n}d_{n}^{T}=\sum_{n=1}^{p}pe_{n}e_{n}^{T}=pI$ . To satisfy property P2, consider any set of linearly dependent vectors $\{v_{0},\cdots,v_{P}\}$ . Then there exists scalars $\alpha_{1},\cdots,\alpha_{P}$ such that $\sum_{n=1}^{P}\alpha_{n}v_{n}=0$ . Now for the choice $d_{n}=\alpha_{n}v_{n}$ the property P2, $\sum_{n=1}^{P}d_{n}=\sum_{n=1}^{P}\alpha_{n}v_{n}=0$ is trivially satisfied. A natural question would be to determine sequences $\{d_{n}\}_{1\leq n\leq P}$ that satisfy both the properties simultaneously.

To address this problem, let us consider a $p\times P$ matrix $Y$ as follows: $Y:=\left[\begin{array}[]{cccc}\uparrow&\uparrow&\cdots&\uparrow\\ d_{1}&d_{2}&\cdots&d_{P}\\ \downarrow&\downarrow&\cdots&\downarrow\\ \end{array}\right].$ Let $u=[1,1,\cdots,1]^{T}$ be a $P\times 1$ dimension vector. The perturbations that satisfy properties P1 and P2 essentially solve the two matrix equations $Yu=0$ and $YY^{T}=PI$ . These equations can be compactly written in a single matrix equation as

[TABLE]

where $X=\left[\begin{array}[]{c}u^{T}\\ Y\end{array}\right]$ . Note that $Y_{p\times P}$ and $P$ are the unknowns here.

It can observed from (10) that $\frac{X}{\sqrt{P}}$ could be treated as a $p\times P$ submatrix of a $P\times P$ orthogonal matrix with the first row being $\frac{u^{T}}{\sqrt{P}}$ , a $1\times P$ vector. It has been shown in [13] that columns of Hadamard matrices satisfy properties P1 and P2 simultaneously with $\bar{P}=2^{\log_{2}\lceil p+1\rceil}$ , i.e., $X$ is chosen as a $(p+1)\times 2^{\log_{2}\lceil p+1\rceil}$ submatrix of the Hadamard matrix. It is not in general clear if the equation (10) can be solved for a smaller $P\leq\bar{P}$ .

Remark 2.

We note that similar analysis for matrix condition for the construction of deterministic perturbations for SPSA estimates involves solving the following matrix system. $AB=PI$ , $Au=0$ and $A\circ B^{T}=vu^{T}$ where $A$ is $p\times P$ , $B$ is $P\times p$ , $u$ is $P\times 1$ vector of ones, $v$ is $p\times 1$ vector of ones and $\circ$ denotes the Hadamard product of the matrices $A$ and $B$ . It is not clear how to solve for $P,$ $A$ and $B$ due to the presence of Hadamard product in this system.

III-B Specific Perturbation Sequence Construction

In this section, our goal is to obtain a sequence with least cycle length. Using a simple matrix rank argument it can be shown that $P$ is at least $p+1$ . Thus, in what follows, we give a construction of deterministic perturbation sequence with cycle length $P=p+1$ . We first write

[TABLE]

where $Z$ is a $p\times p$ matrix and $U$ is any $p\times(P-p)$ matrix with columns that sum to 1. Clearly $Yu=0$ satisfies property P2.

To satisfy property P1, i.e., $YY^{T}=I$ is equivalent to

[TABLE]

Clearly construction of deterministic perturbations with smaller cycle length $P$ is equivalent to solving for $Z$ with an appropriate choice of $U$ .

The simplest choice of $U$ with column sums being 1 is $U=u$ , a $p\times 1$ vector, thus $P=p+1$ . Let $C=I+UU^{T}=I+uu^{T}$ ( $p\times p$ dimensional matrix)

[TABLE]

Observe that $C$ is a positive definite circulant matrix. Hence $C^{-1/2}$ is well defined and the choice $Z=C^{-1/2}$ satisfies (11) and solves the system $YY^{T}=I$ with $P=p+1$ , i.e.,

[TABLE]

The columns of $Y$ finally give us the deterministic perturbations. We note that in general the computation of $C^{-1/2}$ is $O(p^{3})$ and can be very expensive for large $p$ . However owing to the special structure of $C$ , using a Sherman-Morrison type result (see Lemma 1, Section IV), $C^{-1/2}$ can be computed in $O(p^{2})$ time complexity.

III-C Gradient estimation

In this section, we present the RDKW algorithms that use the deterministic perturbation sequence constructed above in two-simulation and one-simulation gradient estimates of the objective. We denote the corresponding algorithms by DSPKW-2C and DSPKW-1C respectively.

Let $\delta_{n},n\geq 0$ denote a sequence of diminishing positive real numbers satisfying assumption A2. in section IV. Let $y_{n}^{+}$ , $y_{n}^{-}$ denote the noisy objective function evaluations at the perturbed parameters $\theta_{n}+\delta_{n}d_{n}$ and $\theta_{n}-\delta_{n}d_{n}$ respectively, i.e., $y_{n}^{+}=J(\theta_{n}+\delta_{n}d_{n})+M_{n+1}^{+}$ and $y_{n}^{-}=J(\theta_{n}-\delta_{n}d_{n})+M_{n+1}^{-}$ . We assume the noise terms $M_{n}^{+},M_{n}^{-}$ are martingale difference noise sequence, $\mathbb{E}\left[M_{n+1}^{+}|\mathcal{F}_{n}\right]=\mathbb{E}\left[M_{n+1}^{-}|\mathcal{F}_{n}\right]=0$ where $\mathcal{F}_{n}=\sigma(\theta_{m},M^{+}_{m},M^{-}_{m},~{}m\leq n)$ is the information conditioned on the past parameter values and martingale difference terms.

The two-simulation and one-simulation estimates of the gradient $\nabla J(\theta_{n})$ based on the observed noisy objective samples for the RDKW algorithm are respectively given by

[TABLE]

respectively. Observe that in the two-sided estimate (14) we use two function samples $y_{n}^{+}$ and $y_{n}^{-}$ and the estimate in (15) uses only one function sample $y_{n}^{+}$ .

Now we briefly describe the DSPKW algorithm. Inputs to the DSPKW algorithm are randomly chosen initial point $\theta_{0}$ , diminishing sequences $\delta_{n}$ and $a_{n}$ satisfying assumption A2. and the matrix of deterministic perturbations $Y$ chosen according to (13). In our algorithms, we iteratively choose the perturbations by cycling through columns of $Y$ with period $p+1$ and in steps 2-4, we update the parameters along the direction of estimated gradient according to (14) in the DSPKW-2C algorithm and according to (15) in the DSPKW-1C algorithm. Note the choice of gradient estimate (or the algorithm) is dictated by the simulation budget given to us. The algorithms terminate by returning the parameter $\theta_{n_{end}}$ at the end of $n_{end}$ iterations.

IV Convergence Analysis

In this section we first provide a few lemmas that assist in computing the proposed deterministic perturbation sequence (see (13) in Section III-B). In the latter part of the section, we prove the almost sure convergence of the iterates for the class of deterministic perturbations characterized in Section III-A.

The following lemma is useful in obtaining the negative square root of $C$ , i.e., $C^{-1/2}$ in a computationally efficient manner. Also note that it takes only $O(p^{2})$ operations to compute $C^{-1/2}$ using the lemma and the circulant structure of $C^{-1/2}$ . Note that the following lemma could also be utilized in an independent context for efficient computation.

Lemma 1.

*Let $I$ be a $p\times p$ identity matrix and

$u=[1,1,\cdots 1]^{T}$ be a $p\times 1$ column vector of 1s, then*

[TABLE]

Proof.

It is enough to show that

[TABLE]

Using $\|u\|^{2}=u^{T}u=p$ in the expansion of $\Big{[}I-\frac{uu^{T}}{p}+\frac{uu^{T}}{p\sqrt{(1+p)}}\Big{]}^{2}$ gives the result. ∎

Let $C$ be defined as in (12) and $Y=\sqrt{p+1}[C^{-1/2},-C^{-1/2}u].$ Let the perturbations $d_{n}$ be the columns of $Y.$

Lemma 2.

The perturbations $d_{n}$ chosen as columns of Y satisfy properties P1 and P2.

Proof.

It easily follows from the discussion in section III-B on the construction of this specific perturbation sequence. ∎

In what follows, we prove the almost sure convergence of the iterates in the DSPKW algorithm (Section III-C) under the following assumptions. Note that $\|.\|$ denotes the 2-norm.

A1.

The map $J:\mathbb{R}^{p}\rightarrow\mathbb{R}$ is Lipschitz continuous and is differentiable with bounded second order derivatives. Further, the map $L:\mathbb{R}^{p}\rightarrow\mathbb{R}^{p}$ defined as $L(\theta)=-\nabla J(\theta)$ is Lipschitz continuous. 2. A2.

The step-size sequences $a_{n},\delta_{n}>0,\forall n$ satisfy

[TABLE]

Further, $\frac{a_{j}}{a_{n}}\rightarrow 1$ as $n\rightarrow\infty$ , for all $j\in\{n,n+1,n+2\cdots,n+M\}$ for any given $M>0$ and $b_{n}=\frac{a_{n}}{\delta_{n}}$ is such that $\frac{b_{j}}{b_{n}}\rightarrow 1$ as $n\rightarrow\infty$ , for all $j\in\{n,n+1,n+2,\cdots,n+M\}.$ 3. A3.

$\max_{n}\|d_{n}\|=K_{0},\max_{n}\|D_{n}\|=K_{1}$ . 4. A4.

The iterates $\theta_{n}$ remain uniformly bounded almost surely, i.e., $\sup_{n}\|\theta_{n}\|<\infty,\text{ a.s.}$ 5. A5.

The ODE $\dot{\theta}(t)=-\nabla J(\theta(t))$ has a compact set $G\subset\mathbb{R}^{p}$ as its set of asymptotically stable equilibria (i.e., the set of local minima of $J$ is compact). 6. A6.

The sequences $(M_{n}^{+},\mathcal{F}_{n}),(M_{n}^{-},\mathcal{F}_{n}),n\geq 0$ form martingale difference sequences. Further, $(M_{n}^{+},M_{n}^{-},n\geq 0)$ are square integrable random variables satisfying

[TABLE]

for a given constant $K>0.$

Remark 3.

Assumptions A1, A2 and A5 are standard stochastic approximation conditions. Assumption A3 trivially follows from Remark 1. Assumption A4 is the stability condition on the iterates and holds in many applications [7] (see the discussion in pp 40-41 of [3]). This condition can also be enforced by projecting the iterates into a compact set, however, the iterates converge to a limiting set that contains all possible limit points (see pp.191 in [3]). Assumption A6 gives the condition on the maximum strength of the martingale difference noise under which convergence of the iterates could be ensured and in many stochastic optimization settings this condition could be easily verified using Jensen’s inequality and Lipschitz continuity of $\nabla J$ .

The following two lemmas aid in the proof of almost sure convergence of the iterates in the DSPKW algorithm.

Lemma 3.

Given any fixed integer $P>0$ , $\|\theta_{m+k}-\theta_{m}\|\rightarrow 0$ $w.p.1,$ as $m\rightarrow\infty,$ for all $k\in\{1,\cdots,P\}.$

Proof.

Fix a $k\in\{1,\cdots,P\}.$ Now

[TABLE]

where $M_{j+1}=\frac{(M_{j+1}^{+}-M_{j+1}^{-})d_{j}}{2\delta_{j}}$ . Thus,

[TABLE]

Now clearly, $N_{n}=\sum\limits_{j=0}^{n-1}a_{j}M_{j+1},n\geq 1,$ forms a martingale sequence with respect to the filtration $\{\mathcal{F}_{n}\}$ . Further, from the assumption (A6) we have,

[TABLE]

From the assumption (A4), the quadratic variation process of $N_{n},n\geq 0$ converges almost surely. Hence by the martingale convergence theorem, it follows that $N_{n},n\geq 0$ converges almost surely. Hence $\|\sum\limits_{j=n}^{n+k-1}a_{j}M_{j+1}\|\rightarrow 0$ almost surely as $n\rightarrow\infty.$ Moreover

[TABLE]

since $\|d_{j}\|\leq K_{0},\forall j\geq 0.$ Note that

[TABLE]

where $\hat{B}$ is the Lipschitz constant of the function $J.$ Hence,

[TABLE]

for $\tilde{B}=$ max $(|J(0)|,\hat{B}).$ Similarly,

[TABLE]

From assumption (A1), it follows that

[TABLE]

for some $\tilde{K}>0.$ Thus,

$\|\theta_{n+k}-\theta_{n}\|\leq\tilde{K}\sum\limits_{j=n}^{n+k-1}\frac{a_{j}}{2\delta_{j}}+\|\sum_{j=n}^{n+k-1}a_{j}M_{j+1}\|$

$\rightarrow 0\text{ a.s. with }n\rightarrow\infty,$ proving the lemma. ∎

Lemma 4.

$\text{ For any }m\geq 0,$ * $\Big{\|}\sum\limits_{n=m}^{m+P-1}\frac{a_{n}}{a_{m}}D_{n}\nabla J(\theta_{n})\Big{\|}\text{ and }$ $\Big{\|}\sum\limits_{n=m}^{m+P-1}\frac{b_{n}}{b_{m}}d_{n}J(\theta_{n})\Big{\|}\rightarrow 0,$ $\text{almost surely, as }m\rightarrow\infty.$ *

Proof.

From Lemma 3, it can be seen that $\|\theta_{m+s}-\theta_{m}\|\rightarrow 0$ as $m\rightarrow\infty,$ for all $s=1,\cdots,P.$ Also, from assumption (A1), we have $\|\nabla J(\theta_{m+s})-\nabla J(\theta_{m})\|\rightarrow 0$ as $m\rightarrow\infty,$ for all $s=1,\cdots,P.$ Now from Lemma 2, $\sum\limits_{n=m}^{m+P-1}D_{n}=0$ $\forall m\geq 0.$ Hence $D_{m}=-\sum\limits_{n=m+1}^{m+P-1}D_{n}.$ Consider first

[TABLE]

$\rightarrow 0\text{ a.s. with }n\rightarrow\infty,$ from assumptions (A1) and (A2). Now observe that $\|J(\theta_{m+k})-J(\theta_{m})\|\rightarrow 0$ as $m\rightarrow\infty,$ for all $k\in\{1,\cdots,P\}$ as a consequence of (A1) and Lemma 3. Moreover from $d_{m}=-\sum\limits_{n=m+1}^{m+P-1}d_{n}$ we have

[TABLE]

The claim now follows as a consequence of assumptions (A1) and (A2). ∎

Finally, using the following theorems, we conclude the analysis by proving the almost sure convergence of the iterates to the set of local minima $G$ of the function $J.$

Theorem 5.

$\theta_{n},n\geq 0$ * obtained from DSPKW-2C satisfy $\theta_{n}\rightarrow G$ almost surely.*

Proof.

Note that

[TABLE]

It follows that

[TABLE]

Now the fourth term on the RHS above can be written as

[TABLE]

where $\xi_{n}=o(1)$ from Lemma 4. Thus, the algorithm is asymptotically analogous to

[TABLE]

Hence, from Theorem 2 in chapter 2 of [borkar2008stochastic], it follows that $\theta_{n},n\geq 0$ converge to a local minima of the function $J.$ ∎

Theorem 6.

$\theta_{n},n\geq 0$ * obtained from DSPKW-1C satisfy $\theta_{n}\rightarrow G$ almost surely.*

Proof.

Note that

[TABLE]

It follows that

[TABLE]

Now we observe that the third term on the RHS above is

[TABLE]

where $\xi^{1}_{n}=o(1)$ by Lemma 4. Similarly

[TABLE]

with $\xi^{2}_{n}=o(1)$ by Lemma 4. The rest follows as in Theorem 5. ∎

V Simulation Experiments

In this section, we compare the numerical performance of our DSPKW-2C algorithm against the RDKW algorithm that uses random Bernoulli perturbations and another variant of the RDKW algorithm that uses Hadamard matrix based deterministic perturbations. We refer them by the acronyms RDKW-2R and RDKW-2H respectively. In a similar manner, we also compare DSPKW-1C algorithm against the one-simulation variants RDKW-1R and RDKW-1H. Note that 2 or 1 in the acronyms of these algorithms denote the number of simulations utilized per iteration.111The implementation is available at https://github.com/cs1070166/1RDSA-2Cand1RDSA-1C/

V-A Experimental setup

For the empirical performance evaluation, we consider the following two loss functions:

Quadratic loss

[TABLE]

Fourth-order loss

[TABLE]

In the loss functions considered above, we set the dimension $p=10$ . We choose $A$ such that $pA$ is an upper triangular matrix with each nonzero entry equal to one and $b$ is a $p$ -dimensional vector of ones. In our experiments, we follow the same noise assumptions considered in [16], i.e., for any $\theta$ , the additive noise in the objective is given by $[\theta^{\mathsf{\scriptscriptstyle T}},1]z$ where $z\sim\mathcal{N}(0,\sigma^{2}I_{p+1\times p+1})$ . In all algorithms, we set the step-size schedule as $\delta_{n}=c/(n+1)^{\gamma}$ and $a_{n}=1/(n+B+1)^{\alpha}$ with $\alpha=0.602$ and $\gamma=0.101$ . Note that the chosen values for $\alpha$ and $\gamma$ have demonstrated good finite-sample performance empirically, while satisfying the theoretical requirements needed for asymptotic convergence (see [16]. We set the same initial point $\theta_{0}$ for all the algorithms.

We consider two settings in our experiments. In the first noise-free setting, we do not add any noise to the objective function evaluations and in the second setting, we corrupt the function evaluations by adding noise (with variance parameter $\sigma=0.01$ as described above). We evaluate the performance of these algorithms based on Normalized Mean Square Error (NMSE) metric. NMSE is defined as the ratio $\left\|\theta_{n_{\text{end}}}-\theta^{*}\right\|^{2}/\left\|\theta_{0}-\theta^{*}\right\|^{2}$ , where $\theta_{n_{\text{end}}}$ is the parameter returned by the algorithm.

V-B Discussion of Results

The performance comparisons of all the algorithms based on NMSE values are summarized in Tables I, II, III and IV. In the tables, we have highlighted the algorithm that has the minimum NMSE. We summarize our findings:

•

Even in the absence of noise, due to the random directions chosen by RDKW-2R and RDKW-1R algorithms, the standard deviation is significantly high compared to the corresponding deterministic counterparts.

•

We would like to emphasize that the quality of the solution (characterized by standard deviation) is significantly better for the case of proposed deterministic perturbations compared to the existing Hadamard based deterministic perturbations and random perturbations. Note however that we do not make comparisons between two-simulation and one-simulation algorithms.

•

In the case of two simulation algorithms (see Tables I and II), DSPKW-2C performs marginally better than RDKW-2H, while both of them outperform RDKW-2R significantly.

•

In the case of one simulation algorithms (see Tables III and IV), DSPKW-1C performs better than both RDKW-1H and RDKW-1R.

VI Conclusions

We have generalized the deterministic perturbation sequences from lexicographical ordering and Hadamard matrix based constructions for the RDKW algorithm and presented a novel construction of deterministic perturbations that has least cycle length within the class of deterministic perturbation sequences. Further, we have proved the almost sure convergence of the iterates for the class of deterministic perturbation sequences. Now that we have a characterization of the class of deterministic perturbation sequences, it would be interesting as future work, to theoretically study and compare the rate of convergence of deterministic perturbation algorithms against their random perturbation counterparts. A challenging future direction would be to study the asymptotic normality or weak convergence of the iterates. It would also be interesting to similarly characterize the class of deterministic perturbation sequences for the SPSA algorithm.

Bibliography16

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. Kiefer and J. Wolfowitz, “Stochastic estimation of the maximum of a regression function,” Ann. Math. Statist. , vol. 23, no. 3, pp. 462–466, 09 1952. [Online]. Available: http://dx.doi.org/10.1214/aoms/1177729392 · doi ↗
2[2] S. Bhatnagar, H. L. Prasad, and L. A. Prashanth, Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods (Lecture Notes in Control and Information Sciences) . Springer, 2013, vol. 434.
3[3] H. J. Kushner and D. S. Clark, Stochastic Approximation Methods for Constrained and Unconstrained Systems . Springer Verlag, 1978.
4[4] Y. M. Ermol’Ev, “On the method of generalized stochastic gradients and quasi-fejér sequences,” Cybernetics , vol. 5, no. 2, pp. 208–220, 1969.
5[5] M. Styblinski and T.-S. Tang, “Experiments in nonconvex optimization: stochastic approximation with function smoothing and simulated annealing,” Neural Networks , vol. 3, no. 4, pp. 467–483, 1990.
6[6] L. Prashanth, S. Bhatnagar, M. Fu, and S. Marcus, “Adaptive system optimization using random directions stochastic approximation,” IEEE Transactions on Automatic Control , vol. 62, no. 5, pp. 2223–2238, 2017.
7[7] J. C. Spall, “Multivariate stochastic approximation using a simultaneous perturbation gradient approximation,” IEEE Trans. Auto. Cont. , vol. 37, no. 3, pp. 332–341, 1992.
8[8] D. C. Chin, “Comparative study of stochastic algorithms for system optimization based on gradient approximations,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics , vol. 27, no. 2, pp. 244–249, 1997.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Generalized Deterministic Perturbations For Stochastic Gradient Search

Abstract

I Introduction

II Conditions on Deterministic Perturbations

Remark 1**.**

III Construction Of Deterministic Perturbations

III-A Matrix condition for Deterministic Perturbations

Remark 2**.**

III-B Specific Perturbation Sequence Construction

III-C Gradient estimation

IV Convergence Analysis

Lemma 1**.**

Proof.

Lemma 2**.**

Proof.

Remark 3**.**

Lemma 3**.**

Proof.

Lemma 4**.**

Proof.

Theorem 5**.**

Proof.

Theorem 6**.**

Proof.

V Simulation Experiments

V-A Experimental setup

Quadratic loss

Fourth-order loss

V-B Discussion of Results

VI Conclusions

Remark 1.

Remark 2.

Lemma 1.

Lemma 2.

Remark 3.

Lemma 3.

Lemma 4.

Theorem 5.

Theorem 6.