Rapidly Adapting Moment Estimation

Guoqiang Zhang; Kenta Niwa; W. Bastiaan Kleijn

arXiv:1902.09030·cs.LG·February 26, 2019

Rapidly Adapting Moment Estimation

Guoqiang Zhang, Kenta Niwa, W. Bastiaan Kleijn

PDF

Open Access

TL;DR

This paper introduces RAME, a new adaptive gradient method that uses the most recent first moment of gradients to compute learning rates, aiming to improve convergence speed and generalization in deep neural network training.

Contribution

RAME is a novel adaptive gradient method that leverages the latest first moment of gradients, differing from existing methods that focus on the second moment.

Findings

01

RAME shows faster convergence compared to Adam and RMSprop.

02

RAME achieves comparable or better generalization performance.

03

Theoretical convergence of deterministic RAME is established.

Abstract

Adaptive gradient methods such as Adam have been shown to be very effective for training deep neural networks (DNNs) by tracking the second moment of gradients to compute the individual learning rates. Differently from existing methods, we make use of the most recent first moment of gradients to compute the individual learning rates per iteration. The motivation behind it is that the dynamic variation of the first moment of gradients may provide useful information to obtain the learning rates. We refer to the new method as the rapidly adapting moment estimation (RAME). The theoretical convergence of deterministic RAME is studied by using an analysis similar to the one used in [1] for Adam. Experimental results for training a number of DNNs show promising performance of RAME w.r.t. the convergence speed and generalization performance compared to the stochastic heavy-ball (SHB) method,…

Equations108

m_{t}

m_{t}

x_{t}

x^{*} = ar g x \in R^{d} min f (x) = ar g x \in R^{d} min i = 1 \sum k f_{i} (x),

x^{*} = ar g x \in R^{d} min f (x) = ar g x \in R^{d} min i = 1 \sum k f_{i} (x),

m_{t}

m_{t}

v_{t}

x_{t}

m_{t}

m_{t}

x_{t}

h (m_{t}) = \frac{m _{t}}{∣ m _{t} ∣ ^{q} + ξ},

h (m_{t}) = \frac{m _{t}}{∣ m _{t} ∣ ^{q} + ξ},

x_{t}

x_{t}

= x_{t - 1} - η_{t} \cdot sign (m_{t}) ⊙ ∣ m_{t} ∣^{1 - q},

x_{t + 1} - x_{t}

x_{t + 1} - x_{t}

(x_{t + 1} - x_{t}) ⊙ ∣ x_{t + 1} - x_{t} ∣^{q / (1 - q)}

(x_{t + 1} - x_{t}) ⊙ ∣ x_{t + 1} - x_{t} ∣^{q / (1 - q)}

= - α_{t} g_{t} + β_{t} (x_{t} - x_{t - 1}) ⊙ ∣ x_{t} - x_{t - 1} ∣^{q / (1 - q)},

\tilde{m}_{t} = - (x_{t + 1} - x_{t}) ⊙ ∣ x_{t + 1} - x_{t} ∣^{q / (1 - q)} t \geq 0.

\tilde{m}_{t} = - (x_{t + 1} - x_{t}) ⊙ ∣ x_{t + 1} - x_{t} ∣^{q / (1 - q)} t \geq 0.

\tilde{m}_{t} = β_{t} \tilde{m}_{t - 1} + α_{t} g_{t} t \geq 1,

\tilde{m}_{t} = β_{t} \tilde{m}_{t - 1} + α_{t} g_{t} t \geq 1,

∣ x_{t + 1} - x_{t} ∣ = ∣ \tilde{m}_{t} ∣^{1 - q} .

∣ x_{t + 1} - x_{t} ∣ = ∣ \tilde{m}_{t} ∣^{1 - q} .

x_{t + 1} = x_{t} - \frac{m ~ _{t}}{∣ m ~ _{t} ∣ ^{q}} .

x_{t + 1} = x_{t} - \frac{m ~ _{t}}{∣ m ~ _{t} ∣ ^{q}} .

f (y) \leq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{L}{2} ∥ y - x ∥^{2} .

f (y) \leq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{L}{2} ∥ y - x ∥^{2} .

β

β

α

t = 2, \dots, T + 1 min ∥\nabla f (x_{t}) ∥_{2} \leq ϵ .

t = 2, \dots, T + 1 min ∥\nabla f (x_{t}) ∥_{2} \leq ϵ .

f (x_{t + 1}) - f (x_{t})

f (x_{t + 1}) - f (x_{t})

\leq ⟨ \nabla f (x_{t}), x_{t + 1} - x_{t} ⟩ + \frac{L}{2} ∥ x_{t + 1} - x_{t} ∥_{2}^{2}

= (a) η_{t} (- ⟨ g_{t}, \frac{m _{t}}{∣ m _{t} ∣ ^{q} + ξ} ⟩ + \frac{L η _{t}}{2} \frac{m _{t}}{∣ m _{t} ∣ ^{q} + ξ}_{2}^{2}),

η_{t} = η_{t}^{*} = \frac{1}{2} \cdot \frac{⟨ g _{t} , \frac{m _{t}}{∣ m _{t} ∣ ^{q} + ξ} ⟩}{\frac{L}{2} ∥ \frac{m _{t}}{∣ m _{t} ∣ ^{q} + ξ} ∥ ^{2}},

η_{t} = η_{t}^{*} = \frac{1}{2} \cdot \frac{⟨ g _{t} , \frac{m _{t}}{∣ m _{t} ∣ ^{q} + ξ} ⟩}{\frac{L}{2} ∥ \frac{m _{t}}{∣ m _{t} ∣ ^{q} + ξ} ∥ ^{2}},

f (x_{t + 1}) - f (x_{t}) \leq - \frac{1}{2 L} \cdot \frac{⟨ g _{t} , \frac{m _{t}}{∣ m _{t} ∣ ^{q} + ξ} ⟩ ^{2}}{\frac{m _{t}}{∣ m _{t} ∣ ^{q} + ξ} _{2}^{2}},

f (x_{t + 1}) - f (x_{t}) \leq - \frac{1}{2 L} \cdot \frac{⟨ g _{t} , \frac{m _{t}}{∣ m _{t} ∣ ^{q} + ξ} ⟩ ^{2}}{\frac{m _{t}}{∣ m _{t} ∣ ^{q} + ξ} _{2}^{2}},

m_{t} = α k = 1 \sum t β^{t - k} g_{k} .

m_{t} = α k = 1 \sum t β^{t - k} g_{k} .

∥ m_{t} ∥_{\infty} \leq \frac{α σ ( 1 - β ^{t} )}{1 - β} .

∥ m_{t} ∥_{\infty} \leq \frac{α σ ( 1 - β ^{t} )}{1 - β} .

λ_{ma x} (diag (∣ m_{t} ∣^{q} + ξ)) \leq \frac{α ^{q} σ ^{q} ( 1 - β ^{t} ) ^{q}}{( 1 - β ) ^{q}} + ξ

λ_{ma x} (diag (∣ m_{t} ∣^{q} + ξ)) \leq \frac{α ^{q} σ ^{q} ( 1 - β ^{t} ) ^{q}}{( 1 - β ) ^{q}} + ξ

λ_{min} (diag (∣ m_{t} ∣^{q} + ξ)) \geq ξ .

λ_{min} (diag (∣ m_{t} ∣^{q} + ξ)) \geq ξ .

\frac{m _{t}}{∣ m _{t} ∣ ^{q} + ξ}_{2}

\frac{m _{t}}{∣ m _{t} ∣ ^{q} + ξ}_{2}

\leq \frac{m _{t}}{∣ m _{t} ∣ ^{q}}_{1}

\leq ∣ m_{t} ∣^{1 - q}_{1}

\leq \frac{d α ^{1 - q} σ ^{1 - q} ( 1 - β ^{t} ) ^{1 - q}}{( 1 - β ) ^{1 - q}},

⟨ g_{t}, \frac{m _{t}}{∣ m _{t} ∣ ^{q} + ξ} ⟩

⟨ g_{t}, \frac{m _{t}}{∣ m _{t} ∣ ^{q} + ξ} ⟩

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques · Machine Learning and ELM

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Adam

Full text

Rapidly Adapting Moment Estimation

Guoqiang Zhang, Kenta Niwa and W. B. Kleijn G. Zhang is with the School of Electrical and Data Engineering, University of Technology, Sydney, Australia. Email: [email protected]. Niwa is with Nippon Telegraph and Telephone (NTT) Corporation, Japan. Email: [email protected]. B. Kleijn is with the School of Engineering and Computer Science, Victoria University of Wellington, New Zealand. Email: [email protected]

Abstract

Adaptive gradient methods such as Adam have been shown to be very effective for training deep neural networks (DNNs) by tracking the second moment of gradients to compute the individual learning rates. Differently from existing methods, we make use of the most recent first moment of gradients to compute the individual learning rates per iteration. The motivation behind it is that the dynamic variation of the first moment of gradients may provide useful information to obtain the learning rates. We refer to the new method as the rapidly adapting moment estimation (RAME). The theoretical convergence of deterministic RAME is studied by using an analysis similar to the one used in [1] for Adam. Experimental results for training a number of DNNs show promising performance of RAME w.r.t. the convergence speed and generalization performance compared to the stochastic heavy-ball (SHB) method, Adam, and RMSprop.

Index Terms:

Adaptive gradient, stochastic heavy-ball method, Adam, RMSprop.

I Introduction

Stochastic gradient descent (SGD) and its variants have been widely applied in deep learning due to their simplicity and effectiveness [2]. Vanilla SGD (i.e., without making use of the gradient trajectory) often works reasonably well given enough time if the learning rate is set properly in a dynamical manner over the training iterations. Generally speaking, the historical gradients of SGD carry information about the local problem structure, such as curvature and individual noise levels of current gradient coordinates. Therefore, it is natural to exploit historical gradients to assist the current parameter update for fast convergence.

In the literature, significant progress has been achieved on making use of historical gradients to accelerate vanilla SGD. Suppose the objective function $f(\boldsymbol{x})$ is differentiable. In 1964, Polyak proposed the so-called heavy-ball (HB) method for minimizing the objective function [3], which is given by

[TABLE]

where $\nabla f(\boldsymbol{x}_{t-1})$ is the gradient at $\boldsymbol{x}_{t-1}$ , and $\alpha_{t}$ (or $\eta_{t}$ )111The Keras platform treats $\alpha_{t}$ as the learning rate and set $\eta_{t}=1$ while Pytorch takes $\eta_{t}$ as the learning rate and set $\alpha_{t}=1$ . is the common learning rate for all the coordinates of $\boldsymbol{x}_{t}$ . Later, in 1983, Nesterov proposed a method to further accelerate HB by making use of the first moments in a smart way [4, 5, 6], which is known as Nesterov’s accelerated gradient (NAG). Considering HB, we note from (2) that $\boldsymbol{x}_{t}$ is updated as a linear function of the first moment $\boldsymbol{m}_{t}$ . To our best knowledge, there is no prior work on designing a nonlinear function of the first moment $\boldsymbol{m}_{t}$ for a more effective parameter update. In this work, we will attempt to do so, where the nonlinearity of $\boldsymbol{m}_{t}$ will be interpreted as a form of individual learning rates as opposed to the common learning rate $\alpha_{t}$ (or $\eta_{t}$ ).

In the last decade, research on computing proper individual learning rates for $\boldsymbol{x}_{t}$ in SGD has made considerable progress. Duchi et. al [7], in 2011, were first to propose the tracking of the second moment of the gradients. The resulting method, Adagrad, computes the gradient based on the tracked information. It is found that AdaGrad converges fast when the gradients are sparse. Following the work of [7], various adaptive gradient methods have been proposed for computing more effective individual learning rates. The methods include, for example, RMSprop [8], Adam [9], NAdam [10], AMSGrad [11], and PAdam [12]. We note that all the above methods need to track a certain form of the second moment of gradients.

While deep learning has seen rapid advances in algorithmic development, theoretical convergence analysis has also made remarkable progress recently. The work of [11] showed that Adam does not converge for a special class of convex optimization problems. The authors of [1] studied the convergence of Adam and RMSprop for smooth nonconvex optimization. [13] and [14] also considered smooth nonconvex optimization. In particular, [13] analyzed the convergence of PAdam while [14] considered AMSGrad and a variant of AdaGrad. From a high level point of view, analysis of nonconvex optimization is highly valuable in practice as training a deep neural network (DNN) is well known to be a nonconvex optimization problem.

In this work, we propose a new adaptive gradient method based on a novel design principle. In the new method, the individual learning rates are computed by using only the most recent first moment. By doing so, the method is able to react to the dynamic variation of the first moment rapidly, which is why it is referred to as rapidly adapting moment estimation (RAME). Our motivation for the new algorithm development is based on the hypothesis that the first moment may already carry useful information to allow for the learning-rate computation. If the first moment is available, it may not be needed to compute the second moment, thus saving a memory space of the DNN model size.

As is summarized in Alg.1, RAME is designed by using a nonlinear function $\boldsymbol{m}_{t}/(|\boldsymbol{m}_{t}|^{q}+\xi)$ of the first moment $\boldsymbol{m}_{t}$ for the parameter update. The nonlinear function makes the heavy-ball (HB) method less heavy. With the expression $1/(|\boldsymbol{m}_{t}|^{q}+\xi)$ , the moment coordinates of $\boldsymbol{m}_{t}$ with large magnitudes receive small learning rates while those with small magnitudes are equipped with relatively large learning rates. To better understand the impact of the expression $1/|(\boldsymbol{m}_{t}|^{q}+\xi)$ , we reformulate and interpret its update expressions from a dynamic system perspective. Its convergence is studied by using an analysis that is similar to that in [1] for deterministic Adam.

We evaluate RAME together with stochastic HB (SHB), Adam, and RMSprop for both classification and regression problems in deep learning. Specifically, four classification tasks are investigated, which are training VGG16 [15] for CIFAR10 and CIFAR100, training ResNet20 [16] for CIFAR10, and training a multiple layer perceptron (MLP) network for CIFAR10. As for regression, we conduct people semantic segmentation using ResNet152 as the backend [16, 17]. The Microsoft COCO database is employed to train the neural network. The convergence results obtained from the above tasks show that RAME produces either better or equivalent validation performance compared to the other three methods.

The remainder of the paper is organized as follows: Section II introduces notations and defines the optimization problem. Section III is devoted to the new method RAME. In Section IV, we provide a new interpretation of RAME from a dynamic system viewpoint. Section V presents the algorithmic convergence analysis. After that, experimental results are then described in Section VI, followed by conclusions in Section VII.

II Notations and Problem Definition

We firstly introduce notations for mathematical description in the remainder of the paper. We use bold small letters to denote vectors and bold capital letters to denote matrices. Given a vector $\boldsymbol{x}\in\mathbb{R}^{d}$ , we denote its $l_{1}$ , $l_{2}$ and $l_{\infty}$ norm as $\|\boldsymbol{x}\|_{1}=\sum_{i=1}^{d}|x_{i}|$ , $\|\boldsymbol{x}\|_{2}=\sqrt{\sum_{i=1}^{d}x_{i}^{2}}$ and $\|\boldsymbol{x}\|_{\infty}=\max_{i=1}^{d}|x_{i}|$ , respectively. We write the vector obtained by computing the absolute value per coordinate of $\boldsymbol{x}$ as $|\boldsymbol{x}|$ . The operation $\textrm{diag}(\boldsymbol{x})$ denotes a diagonal matrix with $\boldsymbol{x}$ on its diagonal. Given two vectors $\boldsymbol{x},\boldsymbol{y}\in\mathbb{R}^{d}$ , $\boldsymbol{x}\odot\boldsymbol{y}$ and $\boldsymbol{x}/\boldsymbol{y}$ represent element-wise vector multiplication and division, respectively. The operation $\langle\boldsymbol{x},\boldsymbol{y}\rangle$ denotes the inner product of the two vectors. For a matrix $\boldsymbol{M}\in\mathbb{R}^{d\times d}$ , we use $\lambda_{\max}(\boldsymbol{M})$ and $\lambda_{\min}(\boldsymbol{M})$ to denote the largest and smallest singular values of $\boldsymbol{M}$ , respectively.

We attempt to solve the following minimization problem of a finite functional sum

[TABLE]

where the $k$ functions $\{f_{i}\}_{i=1}^{k}$ are assumed to be continuously differentiable. In practice, the vector $\boldsymbol{x}$ can be taken as representing the weights of a DNN. Each function $f_{i}$ in (3) can be considered to be constructed from a minibatch of training samples. In total, the $k$ functions cover all the training samples. At each iteration during the optimization procedure, one can either randomly select a function for computation or follow a predefined order from $\{f_{i}\}_{i=1}^{k}$ . The above minibatch-based scheme makes it possible to minimize the overall function $f(\boldsymbol{x})$ under the condition of an extremely large number of training samples and limited computational resources in practice.

III Rapidly Adapting Moment Estimation

III-A On effectiveness of HB

In this subsection, we first briefly present the empirical results collected in [18] by analyzing vanilla SGD. We then study the effectiveness of HB by drawing connections between its update expressions and the observations made in [18].

The recent work [18] investigates the performance of vanilla SGD by testing various setups of the learning rates along different curvature directions. At every iteration, the Hessian matrix is computed in addition to the gradient vector. The sharp curvature directions are then identified as the eigenvectors of the Hessian matrix with large eigenvalues. The model parameters are updated by first projecting the gradient vector along the eigenvectors and then setting individual learning rates along the projections. It is found that faster convergence and better generalization performance can be achieved by setting smaller learning rates for the sharp curvature directions than for the flat directions. That is, it is preferable to suppress the impact of the contributions from the sharp curvature directions and enhance the impact from the remaining directions. The above observations are reasonable as sharp curvature directions would lead to high probabilities of missing the local minimums if their learning rates are not set small.

In practice, it is rather expensive to compute the Hessian matrix. The HB method captures information of the functional curvature by tracking the first moment $\boldsymbol{m_{t}}$ over iterations. Since $\boldsymbol{m_{t}}$ is computed as a weighted average of the past gradients, it is natural that the gradient elements having roughly the same directions across iterations, which correspond to flat curvature directions, would be enhanced. In contrast, the gradient elements with varying directions across iterations due to sharp curvatures would be suppressed in the computation of $\boldsymbol{m_{t}}$ . As a result, when performing the parameter update, HB implicitly sets smaller learning rates for the sharp curvature directions than for the flat curvature directions as suggested by the recent work [18].

We note that the effectiveness of HB can be pushed to a higher level in different ways. It is known that the NAG method accelerates HB by constructing a different linear function of the first moment and gradient in the parameter-update. On the other hand, existing adaptive gradient methods such as Adam modify HB by introducing individual learning rates in addition to the common learning rate $\alpha_{t}$ or $\eta_{t}$ in (1)-(2). By doing so, these methods receive more algorithmic flexibility than HB, leading to a more effective parameter-update. In this work, we intend to construct and apply a nonlinear function of the first moment in the parameter-update of HB, as will be discussed later on.

III-B Revisiting Adam

Currently, Adam [9] is probably the most popular adaptive gradient method in the deep learning community, of which the update expressions can be written as

[TABLE]

where $0<\beta_{1},\beta_{2}<1$ , and $f_{t_{i}}$ represents the function being selected from the $k$ functions in (3) at iteration $t$ . The parameter $\xi>0$ in (6) is introduced to avoid division by zero. The parameter $\alpha_{t}$ is the common learning rate while $1/(\sqrt{\boldsymbol{v}_{t}}+\xi)$ represents the individual learning rates.

Equ. (5) indicates that the second moment $\boldsymbol{v}_{t}$ is obtained from the moving average of squared gradients. That is, only the magnitude information of gradients is reflected in the second moment. With the computation of $1/(\sqrt{\boldsymbol{v}_{t}}+\xi)$ , the gradient elements with large magnitudes across iterations would lead to small learning rates. On the other hand, those with small magnitudes would receive large learning rates and tend to be aggressive when updating their corresponding coordinates of $\boldsymbol{x}$ . This allows Adam to adjust the individual learning rates in a self-adaptive manner.

Finally, it is clear that the first and second moments of Adam carry different dynamic variations of gradients over iterations. The first moment takes the sign of gradients into consideration which is missing in the second moment. One natural research question is if the first moment itself can be used for learning-rate computation. Usage of the second moment might not be the only approach to compute the individual learning rates.

Remark 1.

The method RMSprop [8] can be taken as a special case of Adam by letting $\beta_{1}=0$ in (4). That is, only the second moment is computed for the learning-rate computation.

III-C Algorithm design

Differently from the design strategies of existing adaptive gradient methods, we attempt to make the HB method less aggressive by introducing a nonlinear function of the first moment in the parameter-update. In particular, we design the update expressions of the new method RAME to be

[TABLE]

where $(\alpha_{t},\eta_{t})$ are inherited from (1)-(2), and $\boldsymbol{h}(\boldsymbol{m}_{t})$ is a $d$ -dimensional nonlinear function of $\boldsymbol{m}_{t}$ , given by

[TABLE]

where $\xi\geq 0$ and $1>q\geq 0$ . The upper bound $1>q$ is imposed due to the fact when $q=1$ and $\xi=0$ , the magnitude of $\boldsymbol{m}_{t}$ will be cancelled in computing $\boldsymbol{x}_{t}$ , which is undesirable. The update expressions (7)-(9) are for minibatch-based DNN training. At each iteration, one individual function is selected from the total $k$ functions for the parameter update. When the overall $f(\boldsymbol{x})$ is considered per iteration, RAME becomes deterministic, which is summarized in Alg. 1.

The nonlinear function $\boldsymbol{h}(\boldsymbol{m}_{t})$ ensures that the components of $\boldsymbol{m}_{t}$ with large magnitudes receive smaller learning rates, thus making RAME less aggressive than HB. The motivation behind this modification is that the parameters $\{\beta_{t}\}$ of HB are usually set to be close to 1 while the $\{\alpha_{t}\}$ form a decreasing sequence in practice (see [19] for an example). In this situation, the individual learning rates $1/(|\boldsymbol{m}_{t}|^{q}+\xi)$ make it easier for $\boldsymbol{m}_{t}$ to capture the local functional structure around $\boldsymbol{x}_{t-1}$ .

Conceptually speaking, RAME utilises the dynamics of gradient information to compute the individual learning rates while Adam employs the dynamics of gradient-magnitude information. We note that the results of [18] on the Hessian do not suggest but also do not preclude a relation between the gradient-magnitude information and the optimal individual learning rates. The gradient information may also be a good candidate for computing the individual learning rates.

One common property of RAME and Adam (with fixed $\beta_{2}$ parameter in (5)) is that the individual learning rates of both methods do not decrease monotonically over iterations, which makes it challenging for convergence analysis. In contrast, the three adaptive gradient methods AMSGrad, PAdam and AdaGrad from literature are designed to ensure the property of monotonically decreasing individual learning rates. We note that, at the moment, Adam has gained more popularity than the above three methods for training various DNN models. It might be the non-monotonicity property of the individual learning rates in Adam that makes it remarkably effective. The above hypothesis provides one motivation in designing RAME in this work.

III-D Implementation for different setups of $\xi$

In this subsection, we study the implementation of RAME. Depending on the parameter $\xi$ , the computation for $\boldsymbol{x}_{t}$ can be implemented in different ways. When $\xi>0$ , each coordinate of $|\boldsymbol{m}_{t}|^{q}+\xi$ in the denominator is nonzero. In this case, $\boldsymbol{x}_{t}$ can be computed in a traditional manner without worrying about zero-division.

We now consider the setup $\xi=0$ . As $\boldsymbol{m}_{t}$ is obtained by a weighted summation of the past gradients up to iteration $t$ , it may happen that certain coordinates of $|\boldsymbol{m}_{t}|^{q}$ are zero. To avoid zero-division, we can simply combine $\boldsymbol{m}_{t}$ and $|\boldsymbol{m}_{t}|^{q}$ in (9) when updating $\boldsymbol{x}_{t}$ . That is, $\boldsymbol{x}_{t}$ can be computed as

[TABLE]

where the operator $\textrm{sign}(\cdot)$ computes the sign of the vector.

It is worth pointing out that Adam and other existing adaptive gradient methods do not allow the special setup $\xi=0$ . This is because the dynamics of the second moment $\boldsymbol{v}_{t}$ is different from those of $\boldsymbol{m}_{t}$ or $\boldsymbol{g}_{t}$ . They cannot be combined in a similar manner to (10).

IV A Different Perspective of the Update Expressions of RAME

In this section, we study deterministic RAME in Alg. 1 under the setup $\{(\eta_{t},\xi)=(1,0)\}$ from a different perspective. To do so, we first revisit an alternative representation of the update expressions of HB under $\{\eta_{t}=1\}$ . Based on the observations for HB, we then study the update expressions of RAME from a different point of view.

IV-A Revisiting HB under $\{\eta_{t}=1\}$

It can be shown that the update expressions (1)-(2) of HB under the setup $\{\eta_{t}=1\}$ can be alternatively represented as [3, 6]

[TABLE]

where $\alpha_{t}>0$ and $0\leq\beta_{t}<1$ . It is clear from (11) that the update of $\boldsymbol{x}_{t+1}$ consists of two contributions: one from the current gradient $\boldsymbol{g}_{t}$ and the other from the most recent steering vector $(\boldsymbol{x}_{t}-\boldsymbol{x}_{t-1})$ . In practice, the parameter $\alpha_{t}$ decreases over $t$ while $\{\beta_{t}\}$ are usually set to be close to 1. Therefore, as the iteration index $t$ increases, the steering vector $(\boldsymbol{x}_{t}-\boldsymbol{x}_{t-1})$ has an increasing impact on $\boldsymbol{x}_{t+1}$ compared to the gradient $\boldsymbol{g}_{t}$ . The method name “heavy-ball” indicates that the update $\boldsymbol{x}_{t+1}$ is strongly affected by the most recent steering vector $(\boldsymbol{x}_{t}-\boldsymbol{x}_{t-1})$ .

Algebraically speaking, (11) can be viewed as a dynamic system describing the evolution of the steering vectors $\{\boldsymbol{x}_{i+1}-\boldsymbol{x}_{i}|i=0,1,\ldots\}$ over iterations. $\{\beta_{t}\}$ are the damping scalars penalizing old steering vectors when computing new ones.

IV-B Deterministic RAME under $\{(\eta_{t},\xi)=(1,0)\}$

Thus-far we have briefly studied HB from a dynamic system point of view. In this subsection, we reconsider deterministic RAME also from a dynamic system perspective. To do so, we set $\{(\eta_{t},\xi)=(1,0)\}$ in Alg. 1 for RAME.

We first reformulate the update expressions of deterministic RAME in a similar manner as that of HB, which is presented in a proposition below:

Proposition 1.

Let $\{(\eta_{t},\xi)=(1,0)\}$ in Alg. 1. The update expressions of deterministic RAME can then be reformulated as

[TABLE]

where $0\leq q<1$ , and the iteration index $t\geq 1$ .

Proof.

We show that (12) can be transformed to the update expressions presented in Alg. 1 under the setup $\{(\eta_{t},\xi)=(1,0)\}$ . Define $\tilde{\boldsymbol{m}}_{t}$ to be

[TABLE]

It is straightforward from (12) that the sequence $\{\tilde{\boldsymbol{m}}_{t}\}$ can be computed recursively as

[TABLE]

where the minus sign before $\boldsymbol{g}_{t}$ in (12) is cancelled out due to the minus sign in (13).

Next, without loss of generality, we derive an explicit update expression for $\boldsymbol{x}_{t+1}$ in terms of $\tilde{\boldsymbol{m}}_{t}$ based on (13). Taking absolute value per-coordinate on both sides of (13) and then applying algebra produces

[TABLE]

Finally, plugging (15) into (13) and rearranging the quantities in the equation yields

[TABLE]

By letting $\{\tilde{\boldsymbol{m}}_{t}={\boldsymbol{m}}_{t}|t\geq 0\}$ , it is immediate that the expressions (14) and (16) are identical to those in Alg. 1 under the setup $\{(\eta_{t},\xi)=(1,0)\}$ . The proof is complete. ∎

Equ. (12) is a natural extension of (11) for HB. Each steering vector $(\boldsymbol{x}_{t+1}-\boldsymbol{x}_{t})$ in (12) is modulated by the $\frac{q}{1-q}$ th order of its magnitude, which is represented as $|\boldsymbol{x}_{t+1}-\boldsymbol{x}_{t}|^{q/(1-q)}$ . In the computation of $\boldsymbol{x}_{t+1}$ , the modulation imposes a larger suppression on those elements of $(\boldsymbol{x}_{t+1}-\boldsymbol{x}_{t})$ with large magnitude than on the remaining elements. From an overall perspective, (12) can be viewed as a dynamic system describing the evolution of the modulated steering vectors $\{(\boldsymbol{x}_{i+1}-\boldsymbol{x}_{i})\odot|\boldsymbol{x}_{i+1}-\boldsymbol{x}_{i}|^{q/(1-q)}|i=0,1,\ldots\}$ over iterations.

V Convergence Analysis for Deterministic RAME

In this section, we provide convergence analysis for employing deterministic RAME to solve $L$ -smooth nonconvex optimization. Similarly to Adam with fixed parameter $\beta_{2}$ , the individual learning rates of RAME $\{\frac{1}{|\boldsymbol{m}_{t}|^{q}+\xi}|t=1,2,\ldots\}$ are not guaranteed to decrease monotonically over iterations. Therefore, the approaches in [11, 13, 14] for analyzing AMSGrad, PAdam and AdaGrad can not be exploited to study either Adam or RAME. To our best knowledge, the recent work [1] is the first that provides a rigorous convergence analysis for deterministic Adam for solving $L$ -smooth nonconvex optimization. In the following, we study RAME by following an analysis similar to the one in [1] for Adam.

We first provide the definition of $L$ -smoothness.

Definition 1 ( $L$ -smoothness).

Suppose $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is differentiable. Then $f$ is $L$ -smooth for some $L>0$ if for any $\boldsymbol{x},\boldsymbol{y}\in\mathbb{R}^{d}$ , we have

[TABLE]

Furthermore, $f(\boldsymbol{x})$ is lower bounded, i.e., $\inf_{\boldsymbol{x}}f(\boldsymbol{x})>-\infty$ .

Upon introducing $L$ -smoothness, we present the convergence results of deterministic RAME in a theorem below:

Theorem 1.

Suppose $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is an $L$ -smooth function and the $l_{\infty}$ norm of its gradient $\nabla f(\boldsymbol{x})$ is upper bounded by $\|\nabla f(\boldsymbol{x})\|_{\infty}\leq\sigma$ . Let $\xi>0$ and $(\beta_{t},\alpha_{t})=(\beta,\alpha)$ in Alg. 1. For any $\epsilon>0$ , if the two parameters $(\beta,\alpha)$ are selected to satisfy

[TABLE]

then there exist an iteration index $T$ and a sequence of parameters $\{\eta_{t}>0|t=1,2,\ldots,T\}$ such that

[TABLE]

Proof.

See proof sketch in Appendix A. The basic idea of the argument is from the proof for Theorem 3.4 in [1] for analyzing deterministic Adam. ∎

In practice, the parameter $\beta$ is usually set to be constant. The condition (18) is therefore rather strict. It remains open to tighten the convergence analysis to derive a loose condition on $\beta$ . In Theorem 1, $\alpha$ can be treated the learning rate while the parameters $\{\eta_{t}>0|t=1,2,\ldots,T\}$ can be taken as the additional regulation parameters for the convergence results to hold.

VI Experimental Results

We conduct experiments for two typical problems in deep learning community, which are classification and segmentation of images. The segmentation problem can be viewed as performing regression as the objective function is a combination of binary cross-entropy and Dice loss on an image-pixel level [20].

VI-A Experimental setup

In the experiment, four training methods were evaluated using the Keras-tensorflow platform, which are RAME, SHB, Adam, and RMSprop. To make a fair comparison, all the experiments were conducted based on open-source implementations, links of which will be provided for each task later on. In our implementation, only the training methods and initial learning rates were changed in the original codes for algorithmic comparison.

We now briefly explain the parameter setup for each training method in the experiment. Considering SHB, the parameters $\{\alpha_{t}\}$ are taken as the common learning rates. The setup $(\beta_{t},\eta_{t})=(0.9,1)$ of SHB (see (1)-(2)) was inherited from the open-source for training VGG16, which will be studied in Subsection VI-B later on. The parameters ( $\beta_{1},\beta_{2},\xi$ ) of Adam were set to $(0.9,0.999,10^{-7})$ , which are the default values of the Keras platform. This is because the open source for ResNet20 (see Subsection VI-B) recommends to use Adam with default values. Similarly, the parameters of RMSprop were set to the default values of the Keras platform.

As RAME is a natural extension of HB (or SHB), its parameters were set to $(\beta_{t},\eta_{t},\xi)=(0.9,1,0)$ and $q=(0.125\textrm{ and }0.25)$ as stated in Alg. 1. Our main motivation for choosing $\xi=0$ is because with this setup, deterministic RAME possesses a unified update expression in terms of the modulated steering vectors as summarized in Proposition 1.

Finally, we note that selection of the initial learning rate is essential for the success of a training method. As different training methods are designed by following respective strategies, their optimal initial learning rates are usually different (see [12] for an empirical study of several training methods). In our experiment, five initial learning rates were tested when employing each method in training a DNN, which are given by $\{10^{-i}|i=1,2\ldots,5\}$ . Only the convergence result of the initial learning rate that produces the best validation performance was selected for comparison.

VI-B On training VGG16 and ResNet20 over CIFAR10 and CIFAR100

In the first experiment, we consider training VGG16 [15] and ResNet20 [16], which represent two popular convolutional neural network (CNN) architectures in deep learning. We adopt the existing open sources222 The code for VGG16 is from https://github.com/geifmany/cifar-vgg

The code for ResNet20 is adopted from https://github.com/keras-team/keras/blob/master/examples/cifar10$\_$resnet.py for three tasks, which are training VGG16 over CIFAR10 and CIFAR100, and training ResNet20 over CIFAR10.

We notice that the original implementation for VGG16 employs SHB while the one for ResNet20 uses Adam. The above open-sources were selected on purpose to minimize algorithmic bias that favours the original training method.

The convergence behaviours of the four methods are displayed in Fig. 1. It is seen that the initial learning rate of SHB is the largest, followed by those of RAME for $q=0.125\textrm{ and }0.25$ . If we treat SHB as a special case of RAME with $q=0$ , it is clear that as the parameter $q$ increases from [math] to $0.125$ and finally to $0.25$ , the best initial learning rate decreases accordingly. This might be because as $q$ increases, the individual learning rates $\{\frac{1}{|\boldsymbol{m}_{t}|^{q}}\}$ may have increasing impact on the parameter update, thus only requiring a decreasing contribution from the common learning rates $\{\alpha_{t}\}$ . The above observations drawn from RAME are in line with the fact that both Adam and RMSprop have the same smallest initial learning rate.

It is observed from Fig. 1 that the validation losses and accuracies of Adam and AMSprop are not consistent for VGG16 compared to those of SHB and RAME. That is, both methods produce low validation losses, while their validation accuracies are not high. The true objective function for classification is binary, representing correct or incorrect recognition decisions over one-of-a-discrete-set. To facilitate the training procedure, a continuous objective function in the form of cross-entropy is introduced as an approximate surrogate. The variation of the functional loss in regions of given decision and ground truth are not important. Therefore, validation accuracy can be seen as a more reliable measurement than the validation loss when considering a classification problem.

By inspection of the training losses and validation accuracies of the four methods in Fig. 1, we can conclude that RAME outperforms the other three training methods for VGG16 at the end of the training procedure even though it converges slowly in the beginning. Furthermore, as the parameter $q$ increases from $0.125$ to $0.25$ , RAME delivers decreasing final training loss and increasing final validation accuracy. Considering ResNet20, it is seen that RAME again yields low final training losses compared to the other three methods. As for the final validation accuracies, it performs equally well as SHB and Adam.

VI-C On training a multi-layer perceptron (MLP)

In addition to CNNs, we also tested an MLP for classification over CIFAR10, which is in fact a feedforward fully connected neural network. Our primary research goal is to study the convergence behaviours of the four methods for training an MLP rather than producing high validation accuracy. The implementation is based on the open source333The code for MLP is from https://github.com/aidiary/keras-examples/blob/master/mlp/cifar10.py available on the Keras platform where Adam with default setup is recommended in the original implementation. The tested MLP consists of four layers with neural numbers of $(1024-512-512-10)$ .

Fig. 2 displays the convergence results of the four training methods. It is seen that RAME converges slower than Adam in the beginning. After a certain number of iterations, it converges faster than the other three methods and produces the lowest final training loss. As for the validation accuracies, the performance of RAME and RMSprop are similar. Both methods produce slightly higher accuracies than Adam and SHB.

VI-D On semantic segmentation

We also conduct algorithmic comparison for people semantic segmentation, where the goal is to identify all people in an image on a pixel level [21]. To facilitate the training procedure and achieve high accuracy, one approach is to make use of a well-trained neural network for other purposes as the backend for semantic segmentation. In this work, we choose the version of ResNet152 [17, 16] that is trained for classification over ImageNet as the backend. We adopt an open source implementation developed for a Kaggle competition 444The link is https://github.com/selimsef/dsb2018$\_$topcoders for our experiment, where Adam with default parameter setup was used for training the network. As the the main body of the network already carries informative features of 1000 objects in ImageNet database, we only need to fine-tune the network for the segmentation task.

In the experiment, the Microsoft COCO-2017 database [22] was employed for training the network. The numbers of images for training and validation are 108344 and 4614, respectively. Roughly half the number of images in both the training and validation sets contains persons.

We focus on the performance of SHB, Adam, and RAME (The method RMSprop suffers from significant overfitting effect, and the result is left out to avoid distraction). Each method was fine-tuned for 50 epochs. Further, each epoch took about one and a half hour using a Nvidia 1080 Ti GPU. During the training process, the so-called Intersection over Union (IOU) was measured along with the functional loss. The metric IOU reflects the accuracy of the correctly labelled foreground pixels of people in an image on average.

The convergence results of the three training methods are displayed in Fig. 3. It is clear that RAME outperforms both SHB and Adam w.r.t. the training loss, validation loss and IOU. On the other hand, the training loss of SHB is noticeably higher than those of Adam and RAME. This suggests that the introduction of individual learning rates in SHB accelerates the convergence speed.

VI-E Overall observations from the experiments

All the above experiments indicate that RAME converges faster than SHB at a later stage of the training procedure. Furthermore, RAME exhibits promising generalisation performance over the validation datasets compared to SHB. The results confirm that it is indeed beneficial to choose smaller learning rates for the elements of $\boldsymbol{m}_{t}$ with large magnitudes in SHB.

If we take into account the fact that RAME is designed by making a minor modification to SHB, the new method is both simple and effective. Unlike Adam and RMSprop, RAME does not need to track the second moment of gradients. Instead, the new method only uses a nonlinear function $\boldsymbol{h}(\boldsymbol{m}_{t})$ of the most recent first moment for parameter update.

VII Conclusions

In this paper, we have proposed a new adaptive gradient method for training DNNs, which is referred to as rapidly adapting moment estimation (RAME). The new method is designed by computing the individual learning rates based on the most recent first moment of gradients rather than the traditional second moment of gradients. Compared to the popular training method Adam, RAME saves a memory space of the DNN model size by avoiding the storage of the second moment. One nice property of RAME is that its update expression can be interpreted as describing the evolution of the modulated steering vectors $\{(\boldsymbol{x}_{i+1}-\boldsymbol{x}_{i})\odot|\boldsymbol{x}_{i+1}-\boldsymbol{x}_{i}|^{q/(1-q)}|i=0,1,\ldots\}$ over iterations while other adaptive gradient methods do not have such a property to our best knowledge. Experimental results for training a number DNNs models demonstrate that RAME produces promising convergence performance in comparison to SHB, Adam, and RMSprop.

One future research direction would be to study the possibility of combining RAME and classical adaptive gradient methods such as Adam for designing a more effective training method.

VIII Acknowledgements

The research is financially supported by Nippon Telegraph and Telephone (NTT) Corporation, Japan, in a form of an industrial project between UTS and NTT.

We gratefully acknowledge the assistance offered by Dr Haopeng Li ([email protected]) from Qamcom Research and Technology, Sweden, on implementation of the people segmentation experiment.

Appendix A Proof Sketch for Theorem 1

We study the convergence of Alg. 1 by following a similar argument in [1] for deterministic Adam. That is we will start from the assumption $\{\|\boldsymbol{g}_{t}\|>\epsilon|t\geq 1\}$ and then show that it will lead to a contradiction.

By following the analysis in [1], the first step is to find the optimal parameter $\eta_{t}^{*}$ that leads to a tight upper bound for the functional difference $f(\boldsymbol{x}_{t+1})-f(\boldsymbol{x}_{t})$ . By using the inequality (17) due to $L$ -smoothness, the functional difference at iteration $t$ can be upper bounded as

[TABLE]

where step $(a)$ follows from the update expression of $\boldsymbol{x}_{t+1}$ in Alg. 1. It is noted that the RHS of (20) is a quadratic function of $\eta_{t}$ . It can be shown that when

[TABLE]

the LHS of (20) receives a tight upper bound, which is given by

[TABLE]

which indicates that the functional cost $f(x_{t})$ decreases over iteration $t$ .

As described in [1], the next step is to measure how close it is between the upper bound in (22) and zero. To do so, it is required to derive an upper bound for $\left\|{\frac{\boldsymbol{m}_{t}}{|\boldsymbol{m}_{t}|^{q}+\xi}}\right\|_{2}$ and a lower bound for $\left\langle\boldsymbol{g}_{t},\frac{\boldsymbol{m}_{t}}{|\boldsymbol{m}_{t}|^{q}+\xi}\right\rangle$ , respectively.

We first present two lemmas which will be used for analysis later on:

Lemma 1.

Under the setup $(\beta_{t},\alpha_{t})=(\beta,\alpha)$ , the first moment $\boldsymbol{m}_{t}$ in Alg. 1 can be represented in terms of $\{\boldsymbol{g}_{k}|k=1,\ldots,t\}$ as

[TABLE]

Correspondingly, under the assumption $\|\boldsymbol{g}_{k}\|_{\infty}\leq\sigma$ for all $k\geq 1$ , the $l_{\infty}$ norm of $\boldsymbol{m}_{t}$ is upper bounded as

[TABLE]

Lemma 2.

The minimum and maximum eigenvalue of the diagonal matrix $\textrm{diag}(|\boldsymbol{m}_{t}|^{q}+\xi)$ satisfy

[TABLE]

We now consider deriving an upper bound for $\left\|{\frac{\boldsymbol{m}_{t}}{|\boldsymbol{m}_{t}|^{q}+\xi}}\right\|_{2}$ . It is straightforward that

[TABLE]

where the last inequality follows from (24).

Inspired by the corresponding analysis in [1], the lower bound for $\left\langle\boldsymbol{g}_{t},\frac{\boldsymbol{m}_{t}}{|\boldsymbol{m}_{t}|^{q}+\xi}\right\rangle$ can be derived as

[TABLE]

where $(a)$ follows from (25), $(b)$ uses the triangle inequality, $(c)$ follows from (25)-(26).

Now we are in a position to find the support regions for $\beta$ and $\alpha$ such that the lower bound in (28) is positive, which is crucial to ensure that $\eta_{t}^{*}$ in (21) is positive. Similarly to the work [1], we first consider the support region for $\beta$ . It is clear from (28) that $\beta$ should be chosen such that

[TABLE]

where the assumption $\|\boldsymbol{g}_{t}\|_{2}>\epsilon$ is exploited. Rearranging the inequality (29) produces an upper bound for $\beta$ :

[TABLE]

To simplify analysis later on, a scalar parameter $\theta_{1}$ is introduced as follows

[TABLE]

Suppose $\beta$ satisfies the condition (30). The parameter $\alpha$ should be selected such that

[TABLE]

Based on the above inequality, an upper bound for $\alpha$ can be derived as

[TABLE]

Similarly to $\theta_{1}$ , a new parameter $\theta_{2}$ can be introduced as follows

[TABLE]

A similar definition of $\theta_{1}$ and $\theta_{2}$ can also be found in [1] for deterministic Adam. Finally, under the two conditions (30) and (32), the lower bound (28) can be simplified as

[TABLE]

Upon deriving the upper and lower bounds (27) and (33), the final upper bound for the functional difference in (22) can be represented as

[TABLE]

Summing (34) from $t=2$ until $t=T+1$ produces

[TABLE]

As a result, we have

[TABLE]

where $\boldsymbol{x}^{*}$ represents the optimal solution. If $T$ is chosen to be

[TABLE]

the RHS of (35) is upper bounded by $\epsilon^{2}$ , which violates the assumption of $\{\|\boldsymbol{g}_{t}\|>\epsilon|t\geq 1\}$ . The proof is complete.

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. De, A. Mukherjee, and E. Ullah, “Convergence guarantees for RMS Prop and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration,” ar Xiv:1807.06766 [cs.LG], 2018.
2[2] Y. Le Cun, Y. Bengio, and G. Hinton, “Deep Learning,” Nature , vol. 521, pp. 436–444, 2015.
3[3] B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Computational Mathematics and Mathematical Physics , vol. 4, pp. 1–17, 1964.
4[4] Y. Nesterov, “A method of solving a convex programming problem with convergence rate O(1/sqr(k)),” Soviet Mathematics Doklady , vol. 27, pp. 372–376, 1983.
5[5] H. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in International conference on Machine Learning (ICML) , 2013.
6[6] Z. L. T. Yang, Q. Lin, “Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-Convex Optimization,” ar Xiv:1604.03257, 2016.
7[7] J. Duchi, E. Hazan, and Y. Singer, “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization,” Journal of Machine Learning Research , vol. 12, pp. 2121–2159, 2011.
8[8] T. Tieleman and G. Hinton, “Lecture 6.5-RMS Prop: Divide The Gradient by a Running Average of Its Recent Magnitude,” COURSERA: Neural networks for machine learning, pp. 26–31, 2012.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Rapidly Adapting Moment Estimation

Abstract

Index Terms:

I Introduction

II Notations and Problem Definition

III Rapidly Adapting Moment Estimation

III-A On effectiveness of HB

III-B Revisiting Adam

Remark 1**.**

III-C Algorithm design

III-D Implementation for different setups of ξ\xiξ

IV A Different Perspective of the Update Expressions of RAME

IV-A Revisiting HB under {ηt=1}\{\eta_{t}=1\}{ηt​=1}

IV-B *Deterministic RAME under {(ηt,ξ)=(1,0)}\{(\eta_{t},\xi)=(1,0)\}{(ηt​,ξ)=(1,0)} *

Proposition 1**.**

Proof.

V Convergence Analysis for Deterministic RAME

Definition 1** (LLL-smoothness).**

Theorem 1**.**

Proof.

VI Experimental Results

VI-A Experimental setup

VI-B On training VGG16 and ResNet20 over CIFAR10 and CIFAR100

VI-C On training a multi-layer perceptron (MLP)

VI-D On semantic segmentation

VI-E Overall observations from the experiments

VII Conclusions

VIII Acknowledgements

Appendix A Proof Sketch for Theorem 1

Lemma 1**.**

Lemma 2**.**

Remark 1.

III-D Implementation for different setups of $\xi$

IV-A Revisiting HB under $\{\eta_{t}=1\}$

IV-B Deterministic RAME under $\{(\eta_{t},\xi)=(1,0)\}$

Proposition 1.

Definition 1 ( $L$ -smoothness).

Theorem 1.

Lemma 1.

Lemma 2.