Exploiting Uncertainty of Loss Landscape for Stochastic Optimization

Vineeth S. Bhaskara; Sneha Desai

arXiv:1905.13200·cs.LG·May 31, 2019

Exploiting Uncertainty of Loss Landscape for Stochastic Optimization

Vineeth S. Bhaskara, Sneha Desai

PDF

Open Access 1 Repo

TL;DR

This paper presents new stochastic optimization methods that incorporate loss landscape uncertainty, improving convergence and generalization in training neural networks by using variance-based momentum variants and a novel regularization technique.

Contribution

It introduces variance-aware momentum variants and a data-driven stochastic regularization method, enhancing optimization and generalization in deep learning.

Findings

01

Improved convergence rate on MNIST and CIFAR-10 datasets.

02

Enhanced generalization through variance-based momentum.

03

Effective exploration in non-convex optimization landscapes.

Abstract

We introduce novel variants of momentum by incorporating the variance of the stochastic loss function. The variance characterizes the confidence or uncertainty of the local features of the averaged loss surface across the i.i.d. subsets of the training data defined by the mini-batches. We show two applications of the gradient of the variance of the loss function. First, as a bias to the conventional momentum update to encourage conformity of the local features of the loss function (e.g. local minima) across mini-batches to improve generalization and the cumulative training progress made per epoch. Second, as an alternative direction for "exploration" in the parameter space, especially, for non-convex objectives, that exploits both the optimistic and pessimistic views of the loss function in the face of uncertainty. We also introduce a novel data-driven stochastic regularization…

Figures28

Click any figure to enlarge with its caption.

Tables4

Table 1. Table 1: Best hyperparameters η 𝜂 \eta obtained for AdamUCB, AdamCB and AdamS algorithms when training MNIST and CIFAR-10 on various architectures.

Dataset	Model	Batch Size	Dropout	$𝜼_{𝑼 𝑪 𝑩}$	$𝜼_{𝑪 𝑩}$	$𝜼_{𝑺}$
MNIST	LR	128	NO	$0.01$	$0.001$	$0.0001$
MNIST	MLP	128	NO	$0.1$	$0.001$	$0.005$
		128	YES	$0.1$	$0.0005$	$0.005$
		16	NO	$0.3$	$0.0001$	$0.05$
CIFAR-10	CNN	128	NO	$0.01$	$5 \times 10^{- 5}$	$0.0001$
		128	YES	$0.05$	$0.0001$	$0.0001$
		16	NO	$0.3$	$1 \times 10^{- 5}$	$0.005$

Table 2. Table 2: Performance of our optimizers (with the best η 𝜂 \eta s according to Table 1 ) compared to Adam when training a LR model on MNIST with batch size 128. The values represent m e a n ± s t d plus-or-minus 𝑚 𝑒 𝑎 𝑛 𝑠 𝑡 𝑑 mean\pm std computed over three random seeded runs of the training.

Epoch	Optimizer	Train Loss	Val. Loss	Val. Acc. (%)
3	Adam	$0.284 \pm 0.000$	$0.279 \pm 0.003$	$91.983 \pm 0.172$
	AdamUCB	$0.298 \pm 0.001$	$0.282 \pm 0.003$	$91.927 \pm 0.099$
	AdamCB	$0.333 \pm 0.004$	$0.307 \pm 0.002$	$91.467 \pm 0.115$
	AdamS	$0.298 \pm 0.001$	$0.282 \pm 0.003$	$91.940 \pm 0.090$
20	Adam	$0.249 \pm 0.000$	$0.273 \pm 0.003$	$92.550 \pm 0.149$
	AdamUCB	$0.250 \pm 0.001$	$0.271 \pm 0.003$	$92.553 \pm 0.136$
	AdamCB	$0.255 \pm 0.001$	$0.270 \pm 0.003$	$92.447 \pm 0.106$
	AdamS	$0.250 \pm 0.001$	$0.271 \pm 0.003$	$92.557 \pm 0.142$
45	Adam	$0.241 \pm 0.001$	$0.279 \pm 0.001$	$92.500 \pm 0.114$
	AdamUCB	$0.241 \pm 0.001$	$0.278 \pm 0.002$	$92.493 \pm 0.105$
	AdamCB	$0.242 \pm 0.002$	$0.278 \pm 0.001$	$92.393 \pm 0.086$
	AdamS	$0.241 \pm 0.001$	$0.278 \pm 0.002$	$92.510 \pm 0.131$

Table 3. Table 3: Performance of our optimizers (with the best η 𝜂 \eta s according to Table 1 ) compared to Adam when training a MLP on MNIST under different configurations. The values represent m e a n ± s t d plus-or-minus 𝑚 𝑒 𝑎 𝑛 𝑠 𝑡 𝑑 mean\pm std computed over three random seeded runs of the training.

Batch Size	Dropout	Epoch	Optimizer	Train Loss	Val. Loss	Val. Acc. (%)
		3	Adam	$0.062 \pm 0.001$	$0.077 \pm 0.008$	$97.667 \pm 0.178$
			AdamUCB	$0.055 \pm 0.004$	$0.075 \pm 0.006$	$97.573 \pm 0.204$
			AdamCB	$0.084 \pm 0.002$	$0.089 \pm 0.002$	$97.127 \pm 0.075$
			AdamS	$0.059 \pm 0.002$	$0.074 \pm 0.003$	$97.607 \pm 0.160$
		20	Adam	$0.015 \pm 0.001$	$0.077 \pm 0.011$	$98.037 \pm 0.156$
	NO		AdamUCB	$0.016 \pm 0.006$	$0.061 \pm 0.004$	$98.150 \pm 0.089$
			AdamCB	$0.120 \pm 0.095$	$0.109 \pm 0.046$	$96.663 \pm 1.425$
			AdamS	$0.007 \pm 0.002$	$0.061 \pm 0.007$	$98.343 \pm 0.139$
		45	Adam	$0.009 \pm 0.003$	$0.076 \pm 0.003$	$98.123 \pm 0.202$
			AdamUCB	$0.004 \pm 0.001$	$0.057 \pm 0.001$	$98.417 \pm 0.095$
			AdamCB	$0.017 \pm 0.005$	$0.066 \pm 0.005$	$98.053 \pm 0.085$
128			AdamS	$0.003 \pm 0.001$	$0.063 \pm 0.004$	$98.353 \pm 0.038$
		3	Adam	$0.112 \pm 0.001$	$0.082 \pm 0.001$	$97.453 \pm 0.045$
			AdamUCB	$0.108 \pm 0.003$	$0.083 \pm 0.004$	$97.333 \pm 0.125$
			AdamCB	$0.123 \pm 0.004$	$0.095 \pm 0.003$	$97.017 \pm 0.057$
			AdamS	$0.108 \pm 0.003$	$0.082 \pm 0.006$	$97.403 \pm 0.222$
		20	Adam	$0.054 \pm 0.003$	$0.060 \pm 0.001$	$98.197 \pm 0.040$
	YES		AdamUCB	$0.065 \pm 0.009$	$0.060 \pm 0.001$	$98.223 \pm 0.133$
			AdamCB	$0.107 \pm 0.050$	$0.068 \pm 0.009$	$98.003 \pm 0.237$
			AdamS	$0.062 \pm 0.012$	$0.065 \pm 0.003$	$98.120 \pm 0.090$
		45	Adam	$0.046 \pm 0.002$	$0.055 \pm 0.002$	$98.423 \pm 0.071$
			AdamUCB	$0.052 \pm 0.016$	$0.052 \pm 0.003$	$98.497 \pm 0.119$
			AdamCB	$0.081 \pm 0.045$	$0.066 \pm 0.016$	$98.060 \pm 0.418$
			AdamS	$0.039 \pm 0.009$	$0.055 \pm 0.002$	$98.470 \pm 0.108$
		3	Adam	$0.090 \pm 0.002$	$0.098 \pm 0.003$	$97.113 \pm 0.168$
			AdamUCB	$0.093 \pm 0.002$	$0.094 \pm 0.007$	$97.157 \pm 0.220$
			AdamCB	$0.117 \pm 0.005$	$0.109 \pm 0.002$	$96.570 \pm 0.056$
			AdamS	$0.095 \pm 0.002$	$0.084 \pm 0.007$	$97.493 \pm 0.233$
		20	Adam	$0.042 \pm 0.001$	$0.107 \pm 0.004$	$97.373 \pm 0.131$
16	NO		AdamUCB	$0.043 \pm 0.002$	$0.108 \pm 0.022$	$97.313 \pm 0.500$
			AdamCB	$0.066 \pm 0.004$	$0.105 \pm 0.008$	$96.983 \pm 0.327$
			AdamS	$0.045 \pm 0.002$	$0.128 \pm 0.011$	$96.910 \pm 0.456$
		45	Adam	$0.035 \pm 0.000$	$0.105 \pm 0.020$	$97.793 \pm 0.246$
			AdamUCB	$0.037 \pm 0.003$	$0.104 \pm 0.021$	$97.690 \pm 0.270$
			AdamCB	$0.050 \pm 0.005$	$0.111 \pm 0.030$	$97.117 \pm 0.692$
			AdamS	$0.039 \pm 0.003$	$0.099 \pm 0.011$	$97.757 \pm 0.159$

Table 4. Table 4: Performance of our optimizers (with the best η 𝜂 \eta s according to Table 1 ) compared to Adam when training a c64-c64-c128-1000 CNN on CIFAR-10 under different configurations. The values represent m e a n ± s t d plus-or-minus 𝑚 𝑒 𝑎 𝑛 𝑠 𝑡 𝑑 mean\pm std computed over three random seeded runs of the training.

Batch Size	Dropout	Epoch	Optimizer	Train Loss	Val. Loss	Val. Acc. (%)
		3	Adam	$1.133 \pm 0.091$	$1.072 \pm 0.051$	$62.030 \pm 1.884$
			AdamUCB	$1.055 \pm 0.005$	$0.971 \pm 0.028$	$65.800 \pm 1.025$
			AdamCB	$1.000 \pm 0.013$	$0.912 \pm 0.049$	$67.803 \pm 2.104$
			AdamS	$0.994 \pm 0.015$	$0.903 \pm 0.024$	$68.073 \pm 1.395$
		20	Adam	$0.646 \pm 0.078$	$0.698 \pm 0.054$	$76.253 \pm 1.975$
	NO		AdamUCB	$0.561 \pm 0.004$	$0.655 \pm 0.025$	$77.843 \pm 0.939$
			AdamCB	$0.517 \pm 0.004$	$0.646 \pm 0.012$	$78.370 \pm 0.210$
			AdamS	$0.517 \pm 0.006$	$0.616 \pm 0.009$	$79.220 \pm 0.448$
		45	Adam	$0.514 \pm 0.072$	$0.651 \pm 0.062$	$78.573 \pm 2.584$
			AdamUCB	$0.425 \pm 0.009$	$0.589 \pm 0.012$	$81.073 \pm 0.505$
			AdamCB	$0.397 \pm 0.028$	$0.594 \pm 0.026$	$80.987 \pm 0.885$
128			AdamS	$0.383 \pm 0.006$	$0.586 \pm 0.025$	$81.803 \pm 0.665$
		3	Adam	$1.183 \pm 0.015$	$1.061 \pm 0.021$	$62.470 \pm 0.489$
			AdamUCB	$1.171 \pm 0.025$	$1.060 \pm 0.031$	$62.383 \pm 0.957$
			AdamCB	$1.104 \pm 0.025$	$1.020 \pm 0.052$	$64.110 \pm 1.187$
			AdamS	$1.143 \pm 0.013$	$1.035 \pm 0.023$	$63.133 \pm 0.740$
		20	Adam	$0.756 \pm 0.013$	$0.709 \pm 0.022$	$75.500 \pm 0.568$
	YES		AdamUCB	$0.743 \pm 0.014$	$0.686 \pm 0.019$	$76.430 \pm 0.661$
			AdamCB	$0.677 \pm 0.006$	$0.641 \pm 0.011$	$78.063 \pm 0.422$
			AdamS	$0.699 \pm 0.039$	$0.653 \pm 0.036$	$77.727 \pm 1.279$
		45	Adam	$0.650 \pm 0.006$	$0.599 \pm 0.027$	$79.617 \pm 0.950$
			AdamUCB	$0.639 \pm 0.008$	$0.595 \pm 0.008$	$79.947 \pm 0.240$
			AdamCB	$0.573 \pm 0.001$	$0.549 \pm 0.014$	$81.203 \pm 0.565$
			AdamS	$0.593 \pm 0.032$	$0.566 \pm 0.025$	$80.793 \pm 0.926$
		3	Adam	$1.447 \pm 0.059$	$1.357 \pm 0.064$	$50.920 \pm 3.231$
			AdamUCB	$1.390 \pm 0.042$	$1.283 \pm 0.057$	$54.137 \pm 2.479$
			AdamCB	$1.335 \pm 0.019$	$1.236 \pm 0.014$	$56.003 \pm 1.024$
			AdamS	$1.319 \pm 0.031$	$1.225 \pm 0.052$	$56.763 \pm 1.792$
		20	Adam	$1.050 \pm 0.073$	$1.001 \pm 0.057$	$65.203 \pm 1.932$
16	NO		AdamUCB	$0.922 \pm 0.054$	$0.914 \pm 0.069$	$68.520 \pm 2.721$
			AdamCB	$0.923 \pm 0.022$	$0.918 \pm 0.018$	$68.757 \pm 0.273$
			AdamS	$0.918 \pm 0.020$	$0.897 \pm 0.015$	$69.443 \pm 0.425$
		45	Adam	$0.825 \pm 0.026$	$0.841 \pm 0.019$	$71.887 \pm 0.870$
			AdamUCB	$0.766 \pm 0.041$	$0.784 \pm 0.040$	$73.393 \pm 1.351$
			AdamCB	$0.783 \pm 0.009$	$0.801 \pm 0.037$	$72.793 \pm 1.072$
			AdamS	$0.765 \pm 0.015$	$0.768 \pm 0.033$	$74.063 \pm 1.095$

Equations37

\displaystyle{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}\mathcal{L}({\boldsymbol{\theta}}_{t-1})=\mathbb{E}_{\mathcal{P}(i)}\bigg{[}{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}\mathcal{L}^{(i)}({\boldsymbol{\theta}}_{t-1})\bigg{]}=\frac{1}{M}\bigg{[}{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(0)}({\boldsymbol{\theta}}_{t-1})+\cdots+{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(M-1)}({\boldsymbol{\theta}}_{t-1})\bigg{]}.

\displaystyle{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}\mathcal{L}({\boldsymbol{\theta}}_{t-1})=\mathbb{E}_{\mathcal{P}(i)}\bigg{[}{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}\mathcal{L}^{(i)}({\boldsymbol{\theta}}_{t-1})\bigg{]}=\frac{1}{M}\bigg{[}{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(0)}({\boldsymbol{\theta}}_{t-1})+\cdots+{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(M-1)}({\boldsymbol{\theta}}_{t-1})\bigg{]}.

m_{t}

m_{t}

= (1 - β) \cdot \nabla_{θ} L^{(i)} (θ_{t - 1}) + \dots + (1 - β) β^{t - 1} \cdot \nabla_{θ} L^{(i - t + 1)} (θ_{0}) .

\displaystyle\widehat{\boldsymbol{m}}_{t}=\frac{1}{\{(1-\beta^{t})/(1-\beta)\}}\bigg{[}{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(i)}({\boldsymbol{\theta}}_{t-1})+\beta\cdot{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(i-1)}({\boldsymbol{\theta}}_{t-2})+\cdots+\beta^{t-1}\cdot{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(i-t+1)}({\boldsymbol{\theta}}_{0})\bigg{]}.

\displaystyle\widehat{\boldsymbol{m}}_{t}=\frac{1}{\{(1-\beta^{t})/(1-\beta)\}}\bigg{[}{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(i)}({\boldsymbol{\theta}}_{t-1})+\beta\cdot{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(i-1)}({\boldsymbol{\theta}}_{t-2})+\cdots+\beta^{t-1}\cdot{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(i-t+1)}({\boldsymbol{\theta}}_{0})\bigg{]}.

m_{t}

m_{t}

P_{β} (p ∣ t) = \frac{β ^{p}}{{( 1 - β ^{t} ) / ( 1 - β )}} . (Also note that p = 0 \sum p = t - 1 P_{β} (p ∣ t) = \frac{\sum _{p = 0}^{p = t - 1} β ^{p}}{{( 1 - β ^{t} ) / ( 1 - β )}} = 1.)

\displaystyle\mathcal{L}^{UCB}({\boldsymbol{\theta}})=\mathbb{E}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}+~{}\eta\cdot\sqrt{\text{Var}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}},

\displaystyle\mathcal{L}^{UCB}({\boldsymbol{\theta}})=\mathbb{E}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}+~{}\eta\cdot\sqrt{\text{Var}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}},

\nabla_{θ} L^{U C B} (θ)

\nabla_{θ} L^{U C B} (θ)

\displaystyle=\mathbb{E}_{i}\bigg{[}{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}+\frac{\eta}{\sigma_{l}}\cdot\left\{\mathbb{E}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\cdot{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}-\mu_{l}\cdot\mathbb{E}_{i}\bigg{[}{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}\right\}

\displaystyle=\mathbb{E}_{i}\bigg{[}\left(1+\eta~{}~{}\frac{\mathcal{L}^{(i)}({\boldsymbol{\theta}})-\mu_{l}}{\sigma_{l}}\right)\cdot{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]},

δ θ_{AdamUCB}

δ θ_{AdamUCB}

\nabla_{θ} E_{P (τ)} [r (τ)]

\nabla_{θ} E_{P (τ)} [r (τ)]

= E_{P (τ)} [r (τ) \cdot \nabla_{θ} lo g P (τ)] - E_{P (τ)} [r (τ)] \cdot E_{P (τ)} [\nabla_{θ} lo g P (τ)]

= Cov_{P (τ)} [r (τ), \nabla_{θ} lo g P (τ)] .

\displaystyle\mathbb{E}_{i}\bigg{[}\left(1+\eta\frac{\mathcal{L}^{(i)}({\boldsymbol{\theta}})-\mu_{l}}{\sigma_{l}}\right)\cdot{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}=\underbrace{\mathbb{E}_{i}\bigg{[}{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}}_{\text{conventional momentum}}+\frac{\eta}{\sigma_{l}}\cdot\underbrace{\mathbb{E}_{i}\bigg{[}\left(\mathcal{L}^{(i)}({\boldsymbol{\theta}})-\mu_{l}\right)\cdot{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}}_{\text{variance-gradient}}.

\displaystyle\mathbb{E}_{i}\bigg{[}\left(1+\eta\frac{\mathcal{L}^{(i)}({\boldsymbol{\theta}})-\mu_{l}}{\sigma_{l}}\right)\cdot{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}=\underbrace{\mathbb{E}_{i}\bigg{[}{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}}_{\text{conventional momentum}}+\frac{\eta}{\sigma_{l}}\cdot\underbrace{\mathbb{E}_{i}\bigg{[}\left(\mathcal{L}^{(i)}({\boldsymbol{\theta}})-\mu_{l}\right)\cdot{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}}_{\text{variance-gradient}}.

\displaystyle\mathbb{E}_{i}\bigg{[}\left(\mathcal{L}^{(i)}({\boldsymbol{\theta}})-\mu_{l}\right)\cdot{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}

\displaystyle\mathbb{E}_{i}\bigg{[}\left(\mathcal{L}^{(i)}({\boldsymbol{\theta}})-\mu_{l}\right)\cdot{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}

= Cov_{i} [L^{(i)} (θ), \nabla_{θ} L^{(i)} (θ)] .

\displaystyle\underbrace{\frac{1}{\sigma_{l}}~{}\Big{\langle}\left[\sigma_{l}+\eta\cdot(l_{t}-\mu_{l})\right]\boldsymbol{g}_{t}\Big{\rangle}}_{\text{MomentumUCB (one-sided bound)}}~{}\longrightarrow~{}\underbrace{\frac{1}{\sigma_{l}}~{}\Big{\langle}\left[\sigma_{l}-\left(\eta-\frac{\sigma_{l}}{|\mu_{l}|}\right)\cdot(l_{t}-\mu_{l})\right]\boldsymbol{g}_{t}\Big{\rangle}}_{\text{MomentumCB (two-sided bound)}}.

\displaystyle\underbrace{\frac{1}{\sigma_{l}}~{}\Big{\langle}\left[\sigma_{l}+\eta\cdot(l_{t}-\mu_{l})\right]\boldsymbol{g}_{t}\Big{\rangle}}_{\text{MomentumUCB (one-sided bound)}}~{}\longrightarrow~{}\underbrace{\frac{1}{\sigma_{l}}~{}\Big{\langle}\left[\sigma_{l}-\left(\eta-\frac{\sigma_{l}}{|\mu_{l}|}\right)\cdot(l_{t}-\mu_{l})\right]\boldsymbol{g}_{t}\Big{\rangle}}_{\text{MomentumCB (two-sided bound)}}.

\displaystyle\delta{\boldsymbol{\theta}}_{\text{AdamCB}}=\frac{\Big{\langle}\left[\sigma_{l}-\left(\eta-\frac{\sigma_{l}}{|\mu_{l}|}\right)(l_{t}-\mu_{l})\right]\boldsymbol{g}_{t}\Big{\rangle}_{\beta_{1}}}{\sqrt{\Big{\langle}\left[\sigma_{l}-\left(\eta-\frac{\sigma_{l}}{|\mu_{l}|}\right)(l_{t}-\mu_{l})\right]^{2}\boldsymbol{g}_{t}^{2}\Big{\rangle}_{\beta_{2}}}}=\frac{\Big{\langle}\left[\sigma_{l}~{}|\mu_{l}|-\left(\eta~{}|\mu_{l}|-\sigma_{l}\right)(l_{t}-\mu_{l})\right]\boldsymbol{g}_{t}\Big{\rangle}_{\beta_{1}}}{\sqrt{\Big{\langle}\left[\sigma_{l}~{}|\mu_{l}|-\left(\eta~{}|\mu_{l}|-\sigma_{l}\right)(l_{t}-\mu_{l})\right]^{2}\boldsymbol{g}_{t}^{2}\Big{\rangle}_{\beta_{2}}}}.

\displaystyle\delta{\boldsymbol{\theta}}_{\text{AdamCB}}=\frac{\Big{\langle}\left[\sigma_{l}-\left(\eta-\frac{\sigma_{l}}{|\mu_{l}|}\right)(l_{t}-\mu_{l})\right]\boldsymbol{g}_{t}\Big{\rangle}_{\beta_{1}}}{\sqrt{\Big{\langle}\left[\sigma_{l}-\left(\eta-\frac{\sigma_{l}}{|\mu_{l}|}\right)(l_{t}-\mu_{l})\right]^{2}\boldsymbol{g}_{t}^{2}\Big{\rangle}_{\beta_{2}}}}=\frac{\Big{\langle}\left[\sigma_{l}~{}|\mu_{l}|-\left(\eta~{}|\mu_{l}|-\sigma_{l}\right)(l_{t}-\mu_{l})\right]\boldsymbol{g}_{t}\Big{\rangle}_{\beta_{1}}}{\sqrt{\Big{\langle}\left[\sigma_{l}~{}|\mu_{l}|-\left(\eta~{}|\mu_{l}|-\sigma_{l}\right)(l_{t}-\mu_{l})\right]^{2}\boldsymbol{g}_{t}^{2}\Big{\rangle}_{\beta_{2}}}}.

\hat{\mathcal{L}}({\boldsymbol{\theta}})=\mathbb{E}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}+\hat{N}\cdot\sqrt{\text{Var}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}}.

\hat{\mathcal{L}}({\boldsymbol{\theta}})=\mathbb{E}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}+\hat{N}\cdot\sqrt{\text{Var}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}}.

\text{when}~{}\hat{N}\sim\mathcal{N}(0,~{}\eta)\implies\hat{\mathcal{L}}\sim\mathcal{N}\left(\mathbb{E}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]},~{}\eta\cdot\sqrt{\text{Var}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}}\right),

\text{when}~{}\hat{N}\sim\mathcal{N}(0,~{}\eta)\implies\hat{\mathcal{L}}\sim\mathcal{N}\left(\mathbb{E}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]},~{}\eta\cdot\sqrt{\text{Var}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}}\right),

\mathbb{E}_{\mathcal{N}}[\hat{\mathcal{L}}]=\mathbb{E}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}+\mathbb{E}_{\mathcal{N}}[\hat{N}]\cdot\sqrt{\text{Var}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}}=\mathbb{E}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]},

\mathbb{E}_{\mathcal{N}}[\hat{\mathcal{L}}]=\mathbb{E}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}+\mathbb{E}_{\mathcal{N}}[\hat{N}]\cdot\sqrt{\text{Var}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}}=\mathbb{E}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bsvineethiitg/adams
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Advanced Neural Network Applications

MethodsAdam · REINFORCE

Full text

Exploiting Uncertainty of Loss Landscape for Stochastic Optimization

Vineeth S. Bhaskara

Department of Computer Science

University of Toronto

[email protected] Sneha Desai

Department of Computer Science

University of Toronto

[email protected]

Abstract

We introduce novel variants of momentum by incorporating the variance of the stochastic loss function. The variance characterizes the confidence or uncertainty of the local features of the averaged loss surface across the i.i.d. subsets of the training data defined by the mini-batches. We show two applications of the gradient of the variance of the loss function. First, as a bias to the conventional momentum update to encourage conformity of the local features of the loss function (e.g. local minima) across mini-batches to improve generalization and the cumulative training progress made per epoch. Second, as an alternative direction for "exploration" in the parameter space, especially, for non-convex objectives, that exploits both the optimistic and pessimistic views of the loss function in the face of uncertainty. We also introduce a novel data-driven stochastic regularization technique through the parameter update rule that is model-agnostic and compatible with arbitrary architectures. We further establish connections to probability distributions over loss functions and the REINFORCE policy gradient update with baseline in RL. Finally, we incorporate the new variants of momentum proposed into Adam, and empirically show that our methods improve the rate of convergence of training based on our experiments on the MNIST and CIFAR-10 datasets.

1 Introduction

††Code for our optimizers and experiments is publicly available at https://github.com/bsvineethiitg/adams.

Training deep neural networks by stochastic gradient descent has been highly successful in solving several important tasks in vision (He et al., 2016), language (Child et al., 2019), and Reinforcement Learning (RL) (Silver et al., 2017). Predominantly, the training procedure for modern deep neural networks involves some variation of vanilla stochastic gradient descent (SGD), where updates to the parameters are based on the gradient computed over the current mini-batch’s loss function.

The mini-batch gradient is a noisy unbiased estimator of the full-gradient. A widely used method to stabilize the mini-batch gradient is momentum, where parameter updates are based on an exponentially weighted average of the previous mini-batch gradients, thus "smoothing" out the oscillations in the updates (Goh, 2017). This improves training speed and convergence significantly (Sutskever et al., 2013), and has remained an essential component of modern optimization algorithms such as Adam and AdaMax (Kingma & Ba, 2014). In Section 3, we present an alternative perspective on why momentum (with the bias-correction term) works by showing that it approximates the full-gradient under certain assumptions and an exponential probability distribution.

In this paper, we propose exploiting the gradient of the second moment (or the "variance-gradient") of the stochastic loss function across mini-batches to quantify the uncertainty or error of the gradient of the first moment estimate (or the momentum) in approximating the full-gradient. The variance-gradient points along directions of the loss surface where the local features either conform or disagree the most across mini-batches.

In Section 4, we introduce MomentumUCB, a biased version of the momentum method that encourages updates along regions of the loss surface that locally conform across mini-batches in addition to the objective of minimizing the expected loss. In Section 5, we introduce MomentumCB and MomentumS that are biased and unbiased versions of the momentum, respectively, and exploit both the optimistic and pessimistic views of the loss surface in the face of uncertainty.

2 Related work

Previous work such as SAG (Roux et al., 2012), SAGA (Defazio et al., 2014) and SVRG (Johnson & Zhang, 2013) propose accelerating SGD through variance reduction of the mini-batch gradient by introducing a baseline for the gradient that is computed every $m$ steps.

Unlike the above work, our paper focuses on variance reduction of the underlying stochastic loss function rather than dealing directly with the variance of the gradient. Similar to SAG and SVRG, we introduce MomentumUCB, a biased estimator that has an additional variance minimization objective. We also empirically show that instead of maximizing or minimizing the variance of the stochastic loss objective throughout the optimization, exploration in the parameter space by alternating between optimistic and pessimistic views of the loss landscape accelerates training and provides an unbiased estimate of the full-gradient.

Similar to SAGA, we introduce MomentumS that is an unbiased estimator. In contrast to SAG, SAGA and SVRG, our variants of momentum introduced in the paper are computationally similar in cost to SGD with conventional momentum.

3 Momentum as an approximation to the full-gradient

Full-gradient

Consider the "full-loss" function $\mathcal{L}({\boldsymbol{\theta}})$ under the parameters ${\boldsymbol{\theta}}$ over the entire training dataset. Let the integer index $i\in[0,M-1]$ denote $i^{\text{th}}$ mini-batch out of a total of $M$ mini-batches of the training dataset. Then one may write the full-batch loss function as $\mathcal{L}({\boldsymbol{\theta}})=\mathbb{E}_{\mathcal{P}(i)}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}$ under $\mathcal{P}(i)=\frac{1}{M}$ . Therefore, the full-gradient at time step $t$ with parameters ${\boldsymbol{\theta}}_{t-1}$ , may be explicitly written as:

[TABLE]

SGD with momentum and bias-correction term

Consider stochastic gradient descent with momentum and bias-correction term in Algorithm (1). At time step $t$ (with the current mini-batch labeled by $i$ ), one may unroll the recurrence for $\boldsymbol{m}_{t}$ as

[TABLE]

Note that the operations on index $i$ are all done in modulo $M$ so that the indices still represent one of the $M$ mini-batches. In Algorithm (1), $\boldsymbol{m}_{t}$ corresponds to an exponentially weighted sum (at timescale $\beta$ ) of the previous gradients across the mini-batches. The exponentially weighted average, $\widehat{\boldsymbol{m}}_{t}$ , is obtained by dividing $\boldsymbol{m}_{t}$ by the sum of the exponential weights $\sum_{j=0}^{j=t-1}(1-\beta)\beta^{j}$ , which precisely gives the term $(1-\beta^{t})$ that is referred to as the "bias-correction term" in Adam (Kingma & Ba, 2014). Therefore, $\widehat{\boldsymbol{m}}_{t}=\boldsymbol{m}_{t}/(1-\beta^{t})$ is the exponentially weighted average of the gradients across the mini-batches at time step $t$ . Writing $\widehat{\boldsymbol{m}}_{t}$ explicitly, we have

[TABLE]

Compare Eq. (4) with the full-gradient in Eq. (1). Since one may not computationally afford to evaluate the network at the current parameters ${\boldsymbol{\theta}}_{t-1}$ for each mini-batch to get the full-gradient at time step $t$ , momentum compromises to using an approximation for the full-gradient noting that ${\boldsymbol{\theta}}_{t-1}\approx{\boldsymbol{\theta}}_{t-2}$ when the learning rate $\alpha$ is sufficiently small. An exponential decay weight proportional to $\beta^{p}$ is considered for the gradient at parameters ${\boldsymbol{\theta}}_{t-1-p}$ as the approximation ${\boldsymbol{\theta}}_{t-1}\approx{\boldsymbol{\theta}}_{t-1-p}$ becomes less reasonable as $p$ gets larger. This can be noted by explicitly rewriting Eq. (4) as an expectation under an exponentially weighted probability distribution $\mathcal{P}_{\beta}(p|t)$ at a given time step $t$ (defined below) as follows:

[TABLE]

Thus, $\widehat{\boldsymbol{m}}_{t}$ can be viewed as an approximation to the full-gradient ${{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}~{}{\mathcal{L}}({\boldsymbol{\theta}}_{t-1})$ at time step $t$ , mini-batch index $i=(t-1)~{}$ mod $(M)$ , and parameters ${\boldsymbol{\theta}}_{t-1}$ .

With this analogy in place, we approximate the gradient of the variance of the mini-batch loss function ("variance-gradient") under the exponentially weighted probability distribution defined above to derive an update rule similar to momentum.

4 MomentumUCB: Biasing momentum along low variance regions of the loss landscape

Consider the loss objective that accounts the variance of the loss function to bias the updates along the regions of the landscape that conform across the mini-batches up to an extent determined by the "confidence hyperparameter" $\eta$ as follows:

[TABLE]

where the subscript $i$ corresponds to a probability distribution over the mini-batches and $\eta>0$ . The above loss objective is similar in form to the Upper Confidence Bound (UCB) acquisition function in Bayesian optimization that is maximized to balance exploration with exploitation.

The additional variance-minimization objective encourages generalization by providing an incentive for performing equally well on individual i.i.d. subsets of the training data (defined by the mini-batches) separately to keep the variance of the loss lower.

Figure (1) illustrates the resultant optimization landscape in 1D, as an example, for the cases of $\eta>0$ (pessimism) and $\eta>0$ (optimism) in the face of uncertainty.

Considering the gradient of $\mathcal{L}^{UCB}({\boldsymbol{\theta}})$ in Eq. (6), one has:

[TABLE]

where $\sigma_{l}^{2}=\text{Var}_{i}[\mathcal{L}^{(i)}({\boldsymbol{\theta}})]=\mathbb{E}_{i}\left[\left(\mathcal{L}^{(i)}({\boldsymbol{\theta}})\right)^{2}\right]-\left(\mathbb{E}_{i}\big{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\big{]}\right)^{2}$ , and $\mu_{l}=\mathbb{E}_{i}\bigg{[}\mathcal{L}^{(i)}({\boldsymbol{\theta}})\bigg{]}$ .

We infer the update rule for computing ${{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{UCB}({\boldsymbol{\theta}})$ by approximating the expectation in Eq. (9) under the exponentially weighted probability distribution $\mathcal{P}_{\beta}(p|t)$ . We refer to the resultant update rule as MomentumUCB to distinguish it from the conventional momentum update.

AdamUCB: Adam with MomentumUCB

Considering the stochastic gradient at time step $t$ to be $\boldsymbol{g}_{t}={{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{(i)}({\boldsymbol{\theta}}_{t-1})$ , one may write the traditional Adam update concisely as $\delta{\boldsymbol{\theta}}_{\text{Adam}}=\frac{\langle\boldsymbol{g}_{t}\rangle_{\beta_{1}}}{\sqrt{\langle\boldsymbol{g}_{t}^{2}\rangle_{\beta_{2}}}}$ , where $\beta$ s are the time scales, and $\langle\cdot\rangle_{\beta}$ denotes $\mathbb{E}_{\mathcal{P}_{\beta}}[\cdot]$ under the exponentially decaying probability distribution.

We introduce AdamUCB in Algorithm (2) that implements MomentumUCB (Eq. (9)) in the parameter update rule as follows

[TABLE]

where $l_{t}=\mathcal{L}^{(i)}({\boldsymbol{\theta}}_{t-1})$ is the mini-batch loss at the current time step $t$ , { $\mu_{l},~{}\sigma_{l}$ } are the mean and the standard deviation of mini-batch losses up to time step $t-1$ , respectively, and $\eta$ is the confidence hyperparameter.

4.1 Connections to policy gradient with baseline in reinforcement learning

Consider a typical setting of a RL problem with $r(\tau)$ representing the cumulative reward for a roll out $\tau$ . The goal is to maximize the expected return, $R=\mathbb{E}_{P(\tau)}[r(\tau)]$ , where $P(\tau)$ represents the probability of the roll out $\tau$ that depends both on the policy and the dynamics of the environment.

The REINFORCE policy gradient update with baseline $b$ can be written as ${{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathbb{E}_{P(\tau)}[r(\tau)]=\mathbb{E}_{P(\tau)}\left[(r(\tau)-b)\cdot{{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\log P(\tau)\right]$ . A common choice for the baseline is the average return obtained so far, i.e., $b=\mathbb{E}_{P(\tau)}[r(\tau)]$ . Substituting into REINFORCE, we have

[TABLE]

Consider the MomentumUCB gradient ${{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\mathcal{L}^{UCB}({\boldsymbol{\theta}})$ from Eq. (9) as follows:

[TABLE]

Simplifying the variance-gradient term, we have

[TABLE]

Therefore, when $r(\tau)$ and $\log P(\tau)$ in Eq. (13) are chosen to be the negative cross-entropy loss $-L_{CE}=\log P(\text{target}|\text{data})$ in a supervised learning setting, for instance, and the co-variance is instead computed across the mini-batches, then the policy gradient term in Eq. (13) reduces to the variance-gradient in Eq. (15).

5 Exploiting pessimism and optimism in the face of uncertainty

In the previous section, we introduced MomentumUCB justifying the case for $\eta>0$ as taking a pessimistic view of the loss surface that biases updates along regions that conform across mini-batches. An undesirable effect of such a variance-minimization objective is that the parameters might eventually land on a plateau of the loss surface, preventing further progress and slowing down training. In this section we propose incorporating the best of both the cases of $\eta>0$ (pessimism in the face of uncertainty) and $\eta<0$ (optimism in the face of uncertainty) where the updates alternate between maximizing and minimizing the variance objective based on a criterion.

5.1 Bounding the relative standard deviation by $\eta$ on both sides

We propose a simple modification to the MomentumUCB term in Eq. (9) by replacing $\eta$ with the difference of the current relative standard deviation (defined by $\frac{\sigma_{l}}{|\mu_{l}|}$ ) and the required relative standard deviation (specified by a hyperparameter $\eta$ that is different from the $\eta$ in MomentumUCB) that must be maintained throughout the optimization. The intuition behind this criterion is that the variance of the loss landscape should not be too low (to avoid undesirable plateaus of the surface) or too high (since a "good" minima likely performs comparably well across the i.i.d. mini-batches).

Replacing $\eta\rightarrow-\left(\eta-\frac{\sigma_{l}}{|\mu_{l}|}\right)$ , we have the following version of Momentum that we call MomentumCB or Confidence bounded Momentum since the standard deviation is bounded on both the sides (two-sided bound):

[TABLE]

When the current relative std. dev. is greater than the specified hyperparameter $\eta$ , the term $\left[-\left(\eta-\frac{\sigma_{l}}{|\mu_{l}|}\right)\right]$ takes a positive sign (with a magnitude proportional to the violation) and updates along directions that reduce the variance (in addition to the usual momentum gradient), and, hence, takes a pessimistic view of the loss surface. Similarly, when the current relative std. dev. is lower than the hyperparameter $\eta$ , the updates get biased along directions that increase the variance.

AdamCB: Adam with MomentumCB

We incorporate MomentumCB into Adam under the exponential probability distribution (similar to Algorithm (2)). The parameter update rule for AdamCB is given by

[TABLE]

5.2 The reparametrization trick and stochastic momentum

In this section we introduce a stochastic regularizer based on the variance-gradient that utilizes both the directions of minimizing and maximizing the variance to randomly "explore" in the parameter space.

By "exploration," especially, in the context of non-convex optimization, we refer to choosing a perturbed parameter based on the variance-gradient after a conventional momentum update step. This acts as an initialization for the subsequent update and allows access to multiple regions of the loss surface over the course of training. Also, unlike dropout (Srivastava et al., 2014), our stochastic regularizer is architecture and model agnostic, and, therefore, is compatible with batch normalization (Li et al., 2018).

If $\eta$ in the Eq. (6) is promoted to a new random variable $\hat{N}$ such that $\hat{N}\sim\mathcal{N}(0,~{}\eta)$ , where $\mathcal{N}(\cdot)$ is a Gaussian distribution and $\eta$ specifies the variance of the Gaussian (different from $\eta$ in MomentumUCB), then we have the loss function $\mathcal{L}^{UCB}({\boldsymbol{\theta}})$ also promoted to a random variable $\hat{\mathcal{L}}({\boldsymbol{\theta}})$ in $\hat{N}$ such that

[TABLE]

Therefore, instead of fixing $\eta$ in MomentumUCB, if one samples it from a Gaussian distribution centered around zero with a specified standard deviation (given by a new hyperparameter $\eta$ ), then one can exploit both the directions of minimizing and maximizing the variance during the optimization.

By the reparametrization trick (Kingma & Welling, 2013), sampling $\hat{N}\sim\mathcal{N}(0,~{}\eta)$ in the above equation for $\hat{\mathcal{L}}$ is equivalent to sampling a loss function from a Gaussian distribution over mini-batch loss functions, i.e.,

[TABLE]

before computing the gradient. We call the approximation of ${{\boldsymbol{{\nabla}}}_{\boldsymbol{\theta}}}\hat{\mathcal{L}}({\boldsymbol{\theta}})$ under the exponential probability distribution $\mathcal{P}_{\beta}(p|t)$ as Stochastic Momentum or MomentumS since $\hat{\mathcal{L}}({\boldsymbol{\theta}})$ is a stochastic variable in $\hat{N}$ .

Interestingly, since $\hat{\mathcal{L}}$ is a random variable in $\hat{N}$ , its expectation over the Gaussian $\mathcal{N}(\cdot)$ results in an unbiased estimate of the full-gradient. That is,

[TABLE]

since $\mathbb{E}_{\mathcal{N}}[\hat{N}]=0$ when $\mathcal{N}\equiv\mathcal{N}(0,~{}\eta)$ . Therefore, MomentumS is an unbiased estimate of the full-gradient when $\hat{N}$ is sampled from $\mathcal{N}(0,~{}\eta)$ at each step.

With recent high-capacity deep neural networks such as OpenAI Five (OpenAI, 2018) being trained for several months continuously, "exploration" in the parameter space ensures that the optimization is not wastefully stuck on a plateau or bounded within a local region.

AdamS: Adam with Stochastic Momentum

We incorporate Stochastic Momentum into Adam by sampling $\eta$ in AdamUCB (see Algorithm (2)) from a zero-centered Gaussian whose standard deviation is provided as a hyperparameter.

6 Experiments

We empirically evaluate the three variants of Adam, namely, AdamUCB, AdamCB and AdamS and compare their performance with the original Adam optimizer. We train multiple architectures of neural networks such as logistic regression (see Figure 2), MLPs (see Figure 3), CNNs (see Figure 4) on MNIST/CIFAR-10 datasets on a single nVIDIA Tesla P4 GPU. The architectures of the networks are chosen to closely resemble the experiments published by Kingma & Ba (2014).

Since our comparison is only among Adam-like optimizers, we use a fixed learning rate of $\alpha=0.001$ (without any scheduling) and do not search over different $\alpha$ s. This is because the RMSprop-like denominator in all of our proposed variants ensures the step size to be roughly the same as Adam. We also use a $L2$ weight-decay of $10^{-4}$ , batch size of 128, and keep the values of $\beta_{1}$ and $\beta_{2}$ fixed to Adam defaults of 0.9 and 0.999, respectively, across all our experiments. The input images are pre-processed by normalizing with the mean and the standard deviation of the pixel values. Best hyperparameter $\eta$ obtained by searching over a grid is used for each variant of the optimizer in the comparison. We present detailed results for additional training configurations (such as different batch sizes, etc) in the appendices.

7 Discussion

For the simple case of logistic regression on MNIST, the original Adam algorithm performs the best across the training metrics (Figure 2). Since convex objectives have an unique solution, the advantage of "exploration" along variance-gradient direction diminishes.

From Figure 3, for the case of MLPs trained on MNIST, AdamS and AdamUCB achieve a lower training and validation loss on average than Adam. For instance, at epoch 20 and 45, AdamS achieves half and one-third of the training error of Adam, respectively.

For CNNs trained on CIFAR-10 (Figure 4), AdamUCB, AdamCB and AdamS perform significantly better the original Adam optimizer when no dropout is used ( $\approx$ 6% improvement in val. acc. at epoch = 3). Not only is the validation loss better for our variants of Adam but also is the rate of convergence of training. We notice that AdamCB and AdamS consistently perform better than AdamUCB in this case. This shows that exploiting both the directions of variance-gradient indeed helps. For the case of CNNs with dropout, AdamCB and AdamS still outperform Adam but with a reduced margin of improvement ( $\approx$ 2% improvement in val. acc. at epoch = 3).

Figure 5 compares the effect of adding dropout to Adam and AdamS for the case of CNNs trained on CIFAR-10 dataset. The stochastic regularization implemented by AdamS leads to faster training convergence and better validation error when compared to dropout regularization.

8 Conclusion and future work

In this paper, we introduced novel ways of incorporating the variance information of the loss landscape across mini-batches for stochastic optimization. Based on our experiments with CIFAR-10, we recommend AdamS for optimizing general non-convex objectives (a good default for $\eta$ is $0.0001$ ).

Our work opens up directions of incorporating existing research on exploration–exploitation trade-off in Bayesian optimization into gradient-based stochastic optimization algorithms. Interesting directions for future research include exploiting other acquisition functions like Probability of Improvement (PI), Expected Improvement (EI), among others, to design loss objectives that efficiently utilize the uncertainty information of the loss landscape to accelerate training.

Investigating SGD with variations of momentum proposed in this paper could also prove to be interesting as our formulation naturally gives a schedule for the learning rate through the variance of the loss. Finally, analyzing the effect of decay schedules for $\eta$ , AdaMax-like modifications to the proposed variants of Adam, and incorporating Nesterov’s momentum-like update rule (Nesterov, 1983; Dozat, 2016) would be other interesting directions to pursue.

Contributions

V.S.B. contributed to the theory, derivations, and the experiments on CIFAR-10 dataset using CNN architectures. S.D. verified the derivations and contributed to the experiments on MNIST using Logistic Regression and MLPs.

Acknowledgments

We greatly acknowledge the supervision of our project by Prof. Jimmy Ba and Prof. Roger Grosse at the University of Toronto. We also acknowledge the Vector Institute Scholarship in Artificial Intelligence (VSAI) for supporting our graduate studies towards MSc in Applied Computing (MScAC).

Appendix A Additional details on our experiments and results

We report the mean and the standard deviation for various performance metrics such as loss and accuracy across three random runs of our experiments. The hyperparameter $\eta$ is tuned for each variant of the optimizer, and the best values found are listed in Table 1.

A.1 Experiment: Logistic Regression

Table 2 summarizes the training and validation scores at different stages of the training under the four optimizers (Adam, AdamUCB, AdamCB and AdamS) for the best values of $\eta$ given in Table 1.

A.2 Experiment: Multi-layer Neural Networks

We additionally experiment by adding a dropout noise layer ( $p_{drop}=0.5$ ) over the output activations of the first hidden layer, and study the effect of different batch sizes.

Table 3 and Figure 6 summarize the training, and validation performances for batch sizes 16 and 128 for the best values of $\eta$ given in Table 1 at different stages of the training.

A.3 Experiment: Convolutional Neural Networks

We include results of our experiments with batch size = 16. Figure 7 shows the improvement in the rate of convergence for the case of batch size = 16. Figure 8 summarizes the results of our experiments without dropout on a mini-batch size of $128$ across different $\eta$ . Table 4 summarize the training, and validation performances for batch sizes 16 and 128 for the best values of $\eta$ given in Table 1 at different stages of the training.

Appendix B Discussion

For the case of MLPs with dropout, AdamS outperforms Adam but with a reduced margin of improvement (see Table 3) when compared to the case without dropout. When a batch size of 16 is used instead, AdamS and Adam perform fairly similar.

For the case of CNNs trained on CIFAR-10, clearly, from the Figure 8, AdamUCB, AdamCB and AdamS perform significantly better for appropriate $\eta$ s than the original Adam optimizer. Figure 7 shows how the proposed versions of Adam accelerate training for batch size = 16. We also notice that the improvement in performance for lower mini-batch sizes is more significant for CIFAR-10.

Table 4 clearly shows how our variants of Adam accelerate training, especially, in the initial few epochs for CIFAR-10 dataset for both the cases of mini-batch sizes 16 and 128. When dropout is used, the improvement in validation accuracy reduces by some margin. ∎

Bibliography15

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Child et al. (2019) Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. ar Xiv e-prints , art. ar Xiv:1904.10509, Apr 2019.
2Defazio et al. (2014) Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems , pp. 1646–1654, 2014.
3Dozat (2016) Timothy Dozat. Incorporating nesterov momentum into adam. 2016.
4Goh (2017) Gabriel Goh. Why momentum really works. Distill , 2(4):e 6, 2017.
5He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 770–778, 2016.
6Johnson & Zhang (2013) Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems , pp. 315–323, 2013.
7Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980 , 2014.
8Kingma & Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114 , 2013.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Exploiting Uncertainty of Loss Landscape for Stochastic Optimization

Abstract

1 Introduction

2 Related work

3 Momentum as an approximation to the full-gradient

Full-gradient

SGD with momentum and bias-correction term

4 MomentumUCB: Biasing momentum along low variance regions of the loss landscape

AdamUCB: Adam with MomentumUCB

4.1 Connections to policy gradient with baseline in reinforcement learning

5 Exploiting pessimism and optimism in the face of uncertainty

5.1 Bounding the relative standard deviation by η\etaη on both sides

AdamCB: Adam with MomentumCB

5.2 The reparametrization trick and stochastic momentum

AdamS: Adam with Stochastic Momentum

6 Experiments

7 Discussion

8 Conclusion and future work

Contributions

Acknowledgments

Appendix A Additional details on our experiments and results

A.1 Experiment: Logistic Regression

A.2 Experiment: Multi-layer Neural Networks

A.3 Experiment: Convolutional Neural Networks

Appendix B Discussion

5.1 Bounding the relative standard deviation by $\eta$ on both sides