Adam Induces Implicit Weight Sparsity in Rectifier Neural Networks

Atsushi Yaguchi; Taiji Suzuki; Wataru Asano; Shuhei Nitta; Yukinobu; Sakata; Akiyuki Tanizawa

arXiv:1812.08119·cs.LG·December 20, 2018

Adam Induces Implicit Weight Sparsity in Rectifier Neural Networks

Atsushi Yaguchi, Taiji Suzuki, Wataru Asano, Shuhei Nitta, Yukinobu, Sakata, Akiyuki Tanizawa

PDF

Open Access

TL;DR

This paper reveals that training deep neural networks with ReLU, L2 regularization, and Adam optimizer naturally induces weight sparsity, enabling effective model reduction without additional regularizers.

Contribution

The study identifies conditions under which Adam training induces implicit weight sparsity and proposes a simple method to reduce model size by removing zero weights.

Findings

01

Adam training induces group sparsity in weights.

02

The proposed reduction method maintains performance.

03

Effective size reduction demonstrated on MNIST and CIFAR-10.

Abstract

In recent years, deep neural networks (DNNs) have been applied to various machine leaning tasks, including image recognition, speech recognition, and machine translation. However, large DNN models are needed to achieve state-of-the-art performance, exceeding the capabilities of edge devices. Model reduction is thus needed for practical use. In this paper, we point out that deep learning automatically induces group sparsity of weights, in which all weights connected to an output channel (node) are zero, when training DNNs under the following three conditions: (1) rectified-linear-unit (ReLU) activations, (2) an $L_{2}$ -regularized objective function, and (3) the Adam optimizer. Next, we analyze this behavior both theoretically and experimentally, and propose a simple model reduction method: eliminate the zero weights after training the DNN. In experiments on MNIST and CIFAR-10 datasets, we…

Tables6

Table 1. TABLE I: Baseline setup of experiments on MNIST dataset.

Preprocess	divide each pixel value by 255
Data augmentation	No
Batch size / #epochs	64 / 100
Learning rate schedule	multiplied by 0.5 every 25 epochs
$L_{2}$ -norm threshold	$ξ = 1.0 \times 10^{- 15}$
$L_{2}$ regularization	Yes ( $λ = 5.0 \times 10^{- 4}$ )
# hidden layers	1
# nodes per layer	1000
Activation function	ReLU
Batch normalization	Yes (before activation)
Initializer	Xavier [27] for weights, 0 for biases
Optimizer	Adam
	( $α = 0.001, β_{1} = 0.9, β_{2} = 0.999, ϵ = 10^{- 8}$ )

Table 2. TABLE II: Comparisons with different optimizers, activation functions, and other components on MNIST dataset.

Optimizer	Adam (baseline)	mSGD	adagrad	RMSprop
Acc. [%]	98.43	98.35	98.31	98.31
Sparsity [%]	70.00	0.00	0.00	65.20
Activation	ReLU (baseline)	$\tanh$	ELU	$sigmoid$
Acc. [%]	98.43	98.39	98.27	98.42
Sparsity [%]	70.00	1.70	77.00	0.00
Others	baseline	He init.	w/o BN	w/o $L_{2}$ reg.
Acc. [%]	98.43	98.42	98.26	98.45
Sparsity [%]	70.00	68.70	82.10	0.00

Table 3. TABLE III: Comparisons with different numbers of hidden nodes on MNIST dataset.

# of nodes	10	50	100	500	1000 (baseline)	2000
Acc. [%]	93.92	97.46	98.17	98.48	98.43	98.45
# of remaining weights	10	50	100	291	300	288

Table 4. TABLE IV: Comparisons with different numbers of hidden layers on MNIST dataset.

# of hidden layers	1 (baseline)	2	3	4	5
Acc. [%]	98.43	98.73	98.78	98.88	98.68
# of remaining weights in each hidden layer	300	128-227	153-75-186	177-78-76-160	168-87-73-75-160

Table 5. TABLE V: Comparisons with different optimizers on CIFAR-10 dataset.

Optimizer	Acc. [%]	Reduced [%]
mSGD (as in [31])	92.45	-
mSGD (our implementation)	93.13	0.00
AMSGRAD	92.64	0.00
AdamW	91.97	0.00
Adam	92.61	53.23

Table 6. TABLE VI: Comparisons with other model reduction method on MNIST dataset.

Method	Error [%]	Reduced [%]	#Neurons
Baseline (cited from [11])	1.43	-	784-500-300-10
Wen et al. (cited from [11])	1.53	83.5	434-174-78-10
ours	1.45	83.7	784-91-174-10

Equations59

Θ min \frac{1}{N} n = 1 \sum N L (u_{n}, v_{n}, Θ) + R (Θ),

Θ min \frac{1}{N} n = 1 \sum N L (u_{n}, v_{n}, Θ) + R (Θ),

R (Θ) = \frac{λ}{2} l = 2 \sum L ∥ W^{(l)} ∥_{F}^{2} (λ > 0),

R (Θ) = \frac{λ}{2} l = 2 \sum L ∥ W^{(l)} ∥_{F}^{2} (λ > 0),

\displaystyle\left\{\begin{array}[]{ll}m_{t}=\beta_{1}m_{t-1}+\left(1-\beta_{1}\right)g_{t}&\left(0<\beta_{1}<1\right)\\ v_{t}=\beta_{2}v_{t-1}+\left(1-\beta_{2}\right)g_{t}^{2}&\left(0<\beta_{2}<1\right)\\ \theta_{t}=\theta_{t-1}-\alpha\frac{m_{t}/\left(1-\beta_{1}^{t}\right)}{\sqrt{v_{t}/\left(1-\beta_{2}^{t}\right)+\epsilon}}&\left(\epsilon>0\right)\end{array},\right.

\displaystyle\left\{\begin{array}[]{ll}m_{t}=\beta_{1}m_{t-1}+\left(1-\beta_{1}\right)g_{t}&\left(0<\beta_{1}<1\right)\\ v_{t}=\beta_{2}v_{t-1}+\left(1-\beta_{2}\right)g_{t}^{2}&\left(0<\beta_{2}<1\right)\\ \theta_{t}=\theta_{t-1}-\alpha\frac{m_{t}/\left(1-\beta_{1}^{t}\right)}{\sqrt{v_{t}/\left(1-\beta_{2}^{t}\right)+\epsilon}}&\left(\epsilon>0\right)\end{array},\right.

g_{k}^{(l)} = \frac{1}{M} i = 1 \sum M \frac{\partial L ( u _{i} , v _{i} , Θ )}{\partial x _{ik}^{(l)}} \frac{\partial x _{ik}^{(l)}}{\partial w _{k}^{(l)}} + λ w_{k}^{(l)} .

g_{k}^{(l)} = \frac{1}{M} i = 1 \sum M \frac{\partial L ( u _{i} , v _{i} , Θ )}{\partial x _{ik}^{(l)}} \frac{\partial x _{ik}^{(l)}}{\partial w _{k}^{(l)}} + λ w_{k}^{(l)} .

\displaystyle\frac{\partial x_{ik}^{\left(l\right)}}{\partial{\bf w}_{k}^{\left(l\right)}}=\left\{\begin{array}[]{ll}{\bf x}_{i}^{\left(l-1\right)}&({{\bf w}_{k}^{\left(l\right)}}^{\top}{\bf x}_{i}^{\left(l-1\right)}+b_{k}^{\left(l\right)}>0)\\ {\bf 0}&({{\bf w}_{k}^{\left(l\right)}}^{\top}{\bf x}_{i}^{\left(l-1\right)}+b_{k}^{\left(l\right)}\leq 0)\end{array},\right.

\displaystyle\frac{\partial x_{ik}^{\left(l\right)}}{\partial{\bf w}_{k}^{\left(l\right)}}=\left\{\begin{array}[]{ll}{\bf x}_{i}^{\left(l-1\right)}&({{\bf w}_{k}^{\left(l\right)}}^{\top}{\bf x}_{i}^{\left(l-1\right)}+b_{k}^{\left(l\right)}>0)\\ {\bf 0}&({{\bf w}_{k}^{\left(l\right)}}^{\top}{\bf x}_{i}^{\left(l-1\right)}+b_{k}^{\left(l\right)}\leq 0)\end{array},\right.

g_{k}^{(l)} = \frac{1}{M} j : w_{k}^{(l)}^{⊤} x_{j}^{(l - 1)} + b_{k}^{(l)} > 0 \sum \frac{\partial L ( u _{j} , v _{j} , Θ )}{\partial x _{j k}^{(l)}} x_{j}^{(l - 1)} + λ w_{k}^{(l)} .

g_{k}^{(l)} = \frac{1}{M} j : w_{k}^{(l)}^{⊤} x_{j}^{(l - 1)} + b_{k}^{(l)} > 0 \sum \frac{\partial L ( u _{j} , v _{j} , Θ )}{\partial x _{j k}^{(l)}} x_{j}^{(l - 1)} + λ w_{k}^{(l)} .

∣ w_{t} - w_{t - 1} ∣ = - α \frac{1 - β _{1}}{( 1 - β _{2} )} \frac{\sum _{i = 0}^{t - 1} β _{1}^{t - 1 - i} w _{i}}{\sum _{i = 0}^{t - 1} β _{2}^{t - 1 - i} w _{i}^{2} + ϵ} .

∣ w_{t} - w_{t - 1} ∣ = - α \frac{1 - β _{1}}{( 1 - β _{2} )} \frac{\sum _{i = 0}^{t - 1} β _{1}^{t - 1 - i} w _{i}}{\sum _{i = 0}^{t - 1} β _{2}^{t - 1 - i} w _{i}^{2} + ϵ} .

∣ w_{t + 1} - w_{t} ∣ =

∣ w_{t + 1} - w_{t} ∣ =

=

⟶ t \to \infty

\displaystyle\eta(x)=\left\{\begin{array}[]{ll}x&\left(x>0\right)\\ \rho x&\left(x\leq 0\right)\end{array},\right.

\displaystyle\eta(x)=\left\{\begin{array}[]{ll}x&\left(x>0\right)\\ \rho x&\left(x\leq 0\right)\end{array},\right.

\displaystyle\left\{\begin{array}[]{ll}v_{t+1}=\mu v_{t}-\alpha\lambda w_{t}&\left(0<\mu<1\right)\\ w_{t+1}=w_{t}+v_{t+1}\end{array}.\right.

\displaystyle\left\{\begin{array}[]{ll}v_{t+1}=\mu v_{t}-\alpha\lambda w_{t}&\left(0<\mu<1\right)\\ w_{t+1}=w_{t}+v_{t+1}\end{array}.\right.

μ^{2} + 2 (1 - α λ - 2) μ + (1 - α λ)^{2} \geq 0

μ^{2} + 2 (1 - α λ - 2) μ + (1 - α λ)^{2} \geq 0

\Rightarrow

w_{t}

w_{t}

= {(1 - γ) β^{t} \frac{1 - ( γ / β ) ^{t}}{1 - γ / β} + γ^{t}} w_{0}

= {\frac{1 - γ}{β - γ} β^{t + 1} [1 - (γ / β)^{t}] + γ^{t}} w_{0} .

\frac{1 - γ}{β - γ} β^{t + 1} [1 - (γ / β)^{t}] + γ^{t} \leq [\frac{( q + q ^{2} - 4 )}{2 q ^{2} - 4} + 1] β^{t}

\frac{1 - γ}{β - γ} β^{t + 1} [1 - (γ / β)^{t}] + γ^{t} \leq [\frac{( q + q ^{2} - 4 )}{2 q ^{2} - 4} + 1] β^{t}

= [\frac{( q + q ^{2} - 4 )}{2 q ^{2} - 4} + 1] (1 - \frac{q - q ^{2} - 4}{2} α λ)^{t}

= O (1 - \frac{q - q ^{2} - 4}{2} α λ)^{t} .

w_{t} = w_{t - 1} - ν \frac{m _{t}}{v _{t} + ϵ},

w_{t} = w_{t - 1} - ν \frac{m _{t}}{v _{t} + ϵ},

ν \frac{w _{t - 1} + β _{1} w _{t - 2} + \dots + β _{1}^{t} w _{0}}{v _{t} + ϵ}

ν \frac{w _{t - 1} + β _{1} w _{t - 2} + \dots + β _{1}^{t} w _{0}}{v _{t} + ϵ}

\leq ν \frac{∣ w _{t - 1} ∣ + β _{1} ∣ w _{t - 2} ∣ + \dots + β _{1}^{t} ∣ w _{0} ∣}{v _{t} + ϵ}

\leq ν \frac{w _{t - 1}^{2} + β _{1} w _{t - 2}^{2} + \dots + β _{1}^{t} w _{0}^{2}}{v _{t} + ϵ} (1 + β_{1} + \dots + β_{1}^{t})

\leq ν \frac{v _{t}}{v _{t} + ϵ} \frac{1}{1 - β _{1}} \leq \frac{α}{1 - β _{2}} =: ν^{'} .

w_{t}

w_{t}

v_{τ + 1} \leq w_{0}^{2} (1 + β_{2} + β_{2}^{2} + \dots + β_{2}^{τ}) \leq \frac{w _{0}^{2}}{1 - β _{2}} .

v_{τ + 1} \leq w_{0}^{2} (1 + β_{2} + β_{2}^{2} + \dots + β_{2}^{τ}) \leq \frac{w _{0}^{2}}{1 - β _{2}} .

v_{τ + 1}

v_{τ + 1}

\leq (ξ^{τ} + β_{2} ξ^{τ - 1} + \dots + β_{2}^{τ}) w_{0}^{2}

\leq (τ + 1) max {ξ, β_{2}}^{τ} w_{0}^{2} .

v_{τ + 1} \leq (1/2)^{k} w_{0}^{2} .

v_{τ + 1} \leq (1/2)^{k} w_{0}^{2} .

w_{τ + 1} \leq w_{τ} (1 - \frac{ν}{( 1/2 ) ^{k} w _{0}^{2} + ϵ}) .

w_{τ + 1} \leq w_{τ} (1 - \frac{ν}{( 1/2 ) ^{k} w _{0}^{2} + ϵ}) .

w_{τ + 1}

w_{τ + 1}

\leq exp (- τ^{*} κ = 1 \sum k - 1 \frac{ν}{w _{0}} 2^{(κ - 1) /2}) w_{0}

= exp (- τ^{*} \frac{ν}{w _{0}} (2^{k - 1} - 1)) w_{0}

\leq exp (- τ^{*} \frac{ν}{w _{0}} (2^{(τ / τ^{*}) /2 - 1} - 1)) w_{0},

τ \geq (\frac{τ ^{*}}{lo g ( 2 )} + 1) lo g [w_{0} \frac{lo g ( w _{0} / δ )}{ν τ ^{*}} + 1] = Ω (lo g lo g (1/ δ)) .

τ \geq (\frac{τ ^{*}}{lo g ( 2 )} + 1) lo g [w_{0} \frac{lo g ( w _{0} / δ )}{ν τ ^{*}} + 1] = Ω (lo g lo g (1/ δ)) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques

Full text

Adam Induces Implicit Weight Sparsity in Rectifier Neural Networks

Atsushi Yaguchi1, Taiji Suzuki23, Wataru Asano1, Shuhei Nitta1, Yukinobu Sakata1, Akiyuki Tanizawa1

1{atsushi.yaguchi, wataru.asano, shuhei.nitta, yuki.sakata, akiyuki.tanizawa}@toshiba.co.jp

[email protected]

1Corporate Research and Development Center, Toshiba Corporation, Kawasaki, Japan

2Graduate School of Information Science and Technology, The University of Tokyo, Japan

3Center for Advanced Integrated Intelligence Research, RIKEN, Tokyo, Japan

Abstract

In recent years, deep neural networks (DNNs) have been applied to various machine leaning tasks, including image recognition, speech recognition, and machine translation. However, large DNN models are needed to achieve state-of-the-art performance, exceeding the capabilities of edge devices. Model reduction is thus needed for practical use. In this paper, we point out that deep learning automatically induces group sparsity of weights, in which all weights connected to an output channel (node) are zero, when training DNNs under the following three conditions: (1) rectified-linear-unit (ReLU) activations, (2) an $L_{2}$ -regularized objective function, and (3) the Adam optimizer. Next, we analyze this behavior both theoretically and experimentally, and propose a simple model reduction method: eliminate the zero weights after training the DNN. In experiments on MNIST and CIFAR-10 datasets, we demonstrate the sparsity with various training setups. Finally, we show that our method can efficiently reduce the model size and performs well relative to methods that use a sparsity-inducing regularizer.

Index Terms:

deep neural networks, model reduction, group sparse, Adam

I Introduction

Recently, deep learning has been successfully applied in various machine learning tasks [1]. For example, it surpassed human performance at an image recognition task on the ImageNet dataset [2]. These successes are supported by the development of activation functions such as ReLU [3], regularization and normalization methods such as dropout [4] and batch normalization (BN) [5], and network architectures such as ResNet [6]. Recent successes are also enabled by the growth of training datasets and increases in computing power. Because wider or deeper network models have become necessary to achieve high performance, hardware with limited memory and computational power are unable to match this performance. Reducing model size is necessary if applications, such as video recognition for automated driving systems and surveillance systems, are to run on edge devices, but this reduction of model size lowers accuracy.

Various methods have been proposed for model reduction [7, 8, 9, 10, 11, 12, 13]. Shi et al. [7] and Han et al. [8] reduced the amount of computation and data without changing the original network architecture. Shi et al. [7] did this by exploiting the sparsity of ReLU activations to reduce the computational load of convolution. Han et al. [8], in contrast, compressed the amount of data by quantizing weight coefficients. Other model reduction methods that slim the network architecture have been proposed [9, 10, 11, 12]. After training redundant models, less-important channels or nodes are pruned based on the $L_{1}$ norm [9] or statistics of the activations [10]. These methods require training redundant models before pruning. To address this, Wen et al. [11], Scardapane et al. [12], and Yoon et al. [13] proposed methods to directly train compact models. Their methods regularize the weights so that only some weights have nonzero coefficients, and eliminate zero weights after training to produce a sparse model. They reported that group sparse regularization, which induces all weights connected to an output channel (node) to be zero, greatly reduces the size of models because it can directly eliminate channels or nodes.

In this paper, first we point out there appears group sparsity of weights during training of DNNs under the following three conditions: (1) ReLU activations, (2) an $L_{2}$ -regularized objective function, and (3) the Adam optimizer [14]. As shown in Fig. 1, though $L_{2}$ regularization of weights does not intrinsically induce sparsity, some weight vectors converge toward the origin (i.e., zero) if these three conditions are met. Interestingly, this behavior does not appear with the momentum-SGD (mSGD) optimizer [15], as shown in Fig. 2. Next, we analyze this behavior both theoretically and experimentally, and propose a simple model reduction method: eliminate the zero weights after training the DNN. Our method is easy to optimize since it does not rely on sparsity-inducing (non-smooth) regularizers such as $L_{2,1}$ norm, and can efficiently control the model size by calibrating the parameter of $L_{2}$ regularization. In experiments, we demonstrate the sparsity on the MNIST and CIFAR-10 datasets with various training setups, and show our method can effectively reduce the model size relative to methods that use a sparsity-inducing regularizer.

In the following, we first give our problem setting and propose a method for model reduction (Section II), then describe its theoretical analysis (Section III). After we briefly review related works (Section IV), we give some experimental results (Section V) and conclude the paper (Section VI).

II Proposed Method

II-A Problem Setting

We consider an $L$ -layer network with parameters ${\bf\Theta}=\left\{{\bf W}^{\left(l\right)},{\bf b}^{\left(l\right)}\right\}_{l=2}^{L}$ , where ${\bf W}^{\left(l\right)}=\left({\bf w}_{1}^{\left(l\right)},{\bf w}_{2}^{\left(l\right)},\dots,{\bf w}_{C^{\left(l\right)}}^{\left(l\right)}\right)\in\mathbb{R}^{W^{\left(l-1\right)}H^{\left(l-1\right)}C^{\left(l-1\right)}\times C^{\left(l\right)}}$ is the weight matrix between layer $l-1$ and layer $l$ whose columns are the per-channel weight vectors. The corresponding bias is ${\bf b}^{\left(l\right)}=\left(b_{1}^{\left(l\right)},b_{2}^{\left(l\right)},\dots,b_{C^{\left(l\right)}}^{\left(l\right)}\right)^{\top}\in\mathbb{R}^{C^{\left(l\right)}}$ . The kernel width is $W^{\left(l-1\right)}$ , the kernel height is $H^{\left(l-1\right)}$ , and the numbers of input and output channels are $C^{\left(l-1\right)}$ and $C^{\left(l\right)}$ , respectively. Fig. 3 illustrates the weight vectors in fully connected and convolution layers. Note that the number of channels equals the number of nodes in the fully connected layer, which corresponds to the case of $W^{\left(l-1\right)}=H^{\left(l-1\right)}=1$ .

Let $\eta(\cdot)$ be an activation function. An input vector to a layer $l$ is then computed as ${\bf x}^{\left(l\right)}=\eta\left({{\bf W}^{\left(l\right)}}^{\top}{\bf x}^{\left(l-1\right)}+{\bf b}^{\left(l\right)}\right)$ for the fully connected layer. ${\bf x}^{\left(l\right)}$ should be defined at each spatial location for the convolution layer, but we use the notation of the fully connected layer for simplicity. The parameters ${\bf\Theta}$ are learned by the optimization problem

[TABLE]

where ${\mathcal{L}}(\cdot)$ and ${\mathcal{R}}(\cdot)$ represent a loss function and a regularization function, respectively. ${\bf u}\in\mathbb{R}^{D_{in}}$ and ${\bf v}\in\mathbb{R}^{D_{out}}$ are an input and a target vector, respectively, for the network, and $N$ is the number of samples in the training set. Instead of using all samples in each timestep of the optimization, we utilize a mini-batch of size $M$ , in which samples are drawn independently and uniformly from the training set.

II-B Model Reduction

To reduce the model size, we utilize group sparsity by pruning weights under a threshold, that is, we prune column vectors ${\bf w}_{k}^{\left(l\right)}\left(k=1,\dots,C^{\left(l\right)}\right)$ in ${\bf W}^{\left(l\right)}$ , satisfying $\|{\bf w}_{k}^{\left(l\right)}\|_{2}<\xi$ for a small positive constant $\xi$ . We induce sparsity by imposing the $L_{2}$ regularization on each weight vector. We therefore define the regularization function as

[TABLE]

where $\lambda$ is a regularization parameter, and $\|{\cdot}\|_{F}$ indicates the Frobenius norm. We use ReLU [3], given as $\eta(x)=\max(x,0)$ , as the activation function for each non-output layer, and optimize the objective function by using Adam [14]. Its update rule for a parameter $\theta$ is given by

[TABLE]

where $g_{t}$ is the gradient at timestep $t$ ; $\beta_{1}$ , $\beta_{2}$ , and $\epsilon$ are positive constants.

The flow of our method is shown in Algorithm 1 as pseudocode. After training the network, the size is reduced by eliminating all weight vectors having $L_{2}$ norm smaller than a threshold $\xi$ . Since the weights converge to near the origin, it is easy to select the threshold. Our method can also effectively reduce the model size by directly eliminating channels or nodes.

III Theoretical Analysis

In this section, we mathematically describe the mechanism of the sparsity and analyze convergence under certain conditions.

III-A Preliminaries

The gradient with respect to ${\bf w}_{k}^{\left(l\right)}$ , the $k$ th weight vector in layer $l$ , is given by

[TABLE]

If the activation function is ReLU,

[TABLE]

then the gradient becomes

[TABLE]

Thus, if there are only a few samples satisfying ${{\bf w}_{k}^{\left(l\right)}}^{\top}{\bf x}_{j}^{\left(l-1\right)}+b_{k}^{\left(l\right)}>0$ (i.e., activated samples), ${\bf g}_{k}^{\left(l\right)}\approx\lambda{\bf w}_{k}^{\left(l\right)}$ , and the vanilla-SGD is used as the optimizer with a step size $\alpha$ , then the weight is updated as ${\bf w}_{k}^{\left(l\right)}\leftarrow{\bf w}_{k}^{\left(l\right)}-\alpha\lambda{\bf w}_{k}^{\left(l\right)}=\left(1-\alpha\lambda\right){\bf w}_{k}^{\left(l\right)}$ , which means the weight decays to zero if $0<\alpha\lambda<1$ . From the observation that 50 – 90% of ReLUs are not activated [3][7], we assume the gradient of the regularization should be dominant for weights connected to such less-activated ReLUs, which induces the convergence to zero. As shown in Fig. 2, however, this behavior does not appear with mSGD. We believe this is due to the difference in convergence rate between the optimizers.

We conducted a preliminary experiment with the Adam optimizer under the condition that ${\bf g}_{k}^{\left(l\right)}\approx\lambda{\bf w}_{k}^{\left(l\right)}$ , and found that after the weight decays to a certain level, it oscillates near the origin, as shown in Fig. 4. For example, a weight decreased from an initial value of 0.2 and then began to oscillate around $10^{-10}$ . This oscillation is not observed with mSGD. In Theorem III.2 in the following subsection, we give the decay rate for Adam under the pre-oscillation condition.

III-B Convergence Analysis

Here, for simplicity of notation, we omit the indices $k$ and $l$ and use the subscript $t$ to indicate the timestep, then use a scalar weight denoted by $w_{t}$ . We suppose a non-activation situation in which the gradient of the loss is zero; thus, $g_{t}=\lambda w_{t}$ . Detailed proofs of the following theorems are given in the appendices.

Proposition III.1.

*Let $\mu=1+\alpha\lambda-q\sqrt{\alpha\lambda}$ for $q>2$ . Under conditions $(1-\alpha\lambda+\mu)^{2}-4\mu\geq 0$ and $0<\lambda\alpha<1$ , $w_{t}$ decays at rate $O\left(\left(1-\frac{q-\sqrt{q^{2}-4}}{2}\sqrt{\alpha\lambda}\right)^{t}\right)$ for mSGD.

Therefore, if $\frac{q-\sqrt{q^{2}-4}}{2}\sqrt{\alpha\lambda}\geq\alpha\lambda$ , then we obtain a convergence rate faster than the $O\left(\left(1-\alpha\lambda\right)^{t}\right)$ given by SGD. In particular, we obtain a convergence rate similar to Nesterov’s acceleration [16].

Theorem III.2.

Without loss of generality, we may suppose $w_{0}>0$ . Then, as long as $t$ satisfies that $w_{\tau}>0$ for $1\leq\tau\leq t$ , $w_{t}$ decays as $w_{t}=O\left(\exp\left(-2^{t}\right)\right)$ for Adam.

The above convergence rate is doubly exponential. Therefore, to achieve $w_{t}\leq\delta,$ we require only $t=\Omega(\log\log(1/\delta))$ steps. We recall that SGD and mSGD require $t=\Omega(\log(1/\delta))$ steps to achieve $w_{t}\leq\delta$ . Thus, the solution with Adam becomes smaller than sufficiently small $\delta$ more rapidly than with SGD and mSGD. We believe this rapid decay before oscillation contributes to the sparsity with Adam.

Although our theorem for Adam does not guarantee convergence to the origin, we have the following proposition.

Proposition III.3.

*If $w_{t}$ converges to $w_{*}$ as $t\to\infty$ , then the limit point must satisfy $w_{*}=0$ .

Substituting a condition: $g_{t}=\lambda w_{t}$ into Eq. (6), the step size of Adam for a sufficiently large $t$ can be represented as

[TABLE]

If $w_{t}$ converges to $w_{*}$ , then as $\sum_{\tau=t}^{\infty}\beta_{1}^{\tau}\to 0$ , $\sum_{\tau=t}^{\infty}\beta_{2}^{\tau}\to 0$ for $t\to\infty$ , it holds that letting $\tilde{\epsilon}=\frac{\epsilon}{\sum_{i=0}^{t}\beta_{2}^{i}}$ , the step size becomes

[TABLE]

The above must be a Cauchy sequence if it converges; that is, the step size must be 0, which is satisfied with $w_{*}=0$ .

IV Related Work

In this section, we briefly review related works, including analyses on training DNNs from the viewpoints of activation functions and optimizers, and methods of model reduction. Group sparsity as discussed in this paper relates to the vanishing gradient problem [17], which is generally caused by saturation nonlinearities such as $\mathrm{sigmoid}$ and $\tanh$ . These functions have certain regimes in which the gradient vanishes, and ReLU also possesses such a regime (specifically, $x<0$ ). Li et al. [18] showed that unsupervised pretraining encourages sparse activations with $\mathrm{sigmoid}$ and ReLU in the resulting DNNs. Though it has been known that some units never activate across the entire training dataset, the so-called dying-ReLU [19], detailed analysis of this has not been reported. The remedy for this problem is leaky-ReLU [20] given by

[TABLE]

which parameterizes the negative slope $(\rho)$ to prevent its gradients from vanishing in the negative regime. In the next section, we compare various activation functions, including leaky-ReLU and saturation nonlinearities, in terms of sparsity.

SGD is a common optimizer for training DNNs. It is a simple method but achieves good generalization performance by careful tuning of step-size and acceleration via the momentum term [15]. In contrast, adaptive gradient methods, including Adagrad [21], RMSprop [22], Adam [14], and AdaDelta [23], commonly attain faster convergence than the SGD by automatically adjusting the step size. Compared with the SGD, however, their generalization performances are worse in many cases. Theoretical analysis has been conducted, and improvements have been reported [24, 25, 26]. Wilson et al. [24] proved that the adaptive methods do not converge to an optimum in a simple convex optimization problem. Reddi et al. [25] proved that the non-convergence of Adam is due to the step-size scaling according to the moving average of the squared gradient, and proposed an improved method, AMSGRAD. Loshchilov et al. [26] proposed AdamW, which decouples the weight-decay term from the update rule of Adam, and showed it achieves performance similar to that of mSGD. Despite these analyses of Adam, it has not been reported that Adam induces implicit weight sparsity in training DNNs. In the next section, we compare Adam against various optimizers including mSGD, adaptive gradient methods, and the improved methods of Adam [25, 26].

There are two broad approaches to model reduction: pruning redundant weights after learning and learning with sparsity-inducing regularizers on weights. In an example of the former approach, Li et al. [9] exploit the $L_{1}$ norm of weight vectors to prune less important channels after learning the DNN, and Polyak et al. [10] do the same based on the statistics of ReLU activations. However, selecting the pruning threshold is difficult, and fine-tuning after pruning is necessary for recovering the accuracy. In our method, since the weights converge to near the origin, it is easy to select the threshold, which only need to be smalls, such as $10^{-15}$ , and the fine-tuning is optional. In the latter approach, Wen et al. [11], Scardapane et al. [12], and Yoon et al. [13] minimize the $L_{2,1}$ norm of weight vectors with the objective function so that channel-level group sparsity is obtained. Our method yields sparsity based on the $L_{2}$ norm, which is a smooth function, making it easy to optimize. We show the effectiveness of our method relative to the other methods [11][12] below.

V Experiments

In this section, we demonstrate the sparsity under various training setups and compare our model reduction method with others that use sparsity-inducing regularizers. The experiments were on image classification tasks using MNIST [28] and CIFAR-10 datasets [29]. In all experiments on both datasets, we used softmax function in output layer and cross-entropy loss for training networks.

V-A Demonstration of Sparsity

On the MNIST dataset, we used fully connected networks with the experimental setup summarized in Table I as the baseline, and evaluated different activation functions, optimizers, and other components. Table II shows the validation accuracy and sparsity of the weight vectors produced by the different setups. There is not much difference in terms of accuracy, but the sparsity varies for each component. Near the top of Table II, RMSprop yields sparsity as well as that of Adam. RMSprop is basically a special case of Adam by setting $\beta_{1}=0$ , and we see it has similar convergence. The middle of Table II shows the results for different activation functions. Sparsity is obtained by ReLU and by exponential linear units (ELU) [30]. Since its gradients mostly vanish in the negative regime except near the origin, we see it behaves like ReLU, that is, the gradient of the objective function is dominated by that of $L_{2}$ regularization, and it yields similar sparsity. While $\tanh$ obtains only limited sparsity, it indicates that saturation nonlinearities have the potential for sparsity. We also investigated the behavior of leaky-ReLU [20]. We plotted the distributions of $L_{2}$ norm with changing its negative slope $\rho$ as shown in Fig. 5. In the case of $\rho$ = 0, which corresponds to ReLU, we can see there are two modes: the lower mode for (near-) zero weights and the other mode for active weights. Observing that the lower mode shifts as the negative slope increases, we see that leaky-ReLU can alleviate the decay of weights by propagating gradients for negative inputs.

Additionally, we evaluated the behavior with an initializer proposed by He et al. [2] and with or without BN layer and $L_{2}$ regularization. The bottom of Table II shows that the initializer and BN layer make no appreciable difference, but $L_{2}$ regularization is needed for sparsity, as expected. Next, we investigated the effects of the number of nodes in the hidden layer and the number of layers in the network. Table III shows the validation accuracy and number of retained weight vectors (i.e., nodes) after training for the different number of nodes. It can be observed that networks having a small number of nodes results in low accuracy and the weights of those are fully retained. We attribute this to such networks not being redundant, meaning that each node in the networks contributes to decreasing the loss (i.e., each node is well activated). In contrast, the numbers of retained weights (and accuracies) in networks with more than 500 nodes were almost identical. This result indicate that wide networks can be reduced to a certain narrow model without sacrificing accuracy. The results with different number of layers are shown in Table IV, where each network was trained with 1000 nodes in each layer. The results indicate that the number of retained weights becomes small in the deeper layers (except for the last hidden layer).

On the CIFAR-10 dataset, we trained a VGG-style covolutional neural network with almost the same setup as in [31]. During a preprocessing step, each image was normalized to have mean 0 and standard deviation 1 over three channels, and horizontal flipping was applied as data augmentation. We again used the initializer proposed by He et al. [2] for weights and applied dropout [4] to convolution and fully connected layers, as in [31]. With the same parameter of $\lambda=5.0\times 10^{-4}$ , we compared against other optimizers: mSGD, AMSGRAD [25], and AdamW [26]. An initial learning rate of 0.1 was used for mSGD and 0.0005 for the others. The batch size, learning rate scheduling, and $L_{2}$ -norm threshold for model reduction are shown in Table I. The maximum validation accuracy during the training for 400 epochs and the ratio of reduced parameters at that time are reported in Table V. We can see that parameters are not reduced by AMSGRAD and AdamW, that is, these optimizers do not induce sparsity. Instead of using the moving average of the squared gradient, AMSGRAD uses its maximum until each timestep. Thus, AMSGRAD does not induce sparsity because the step-size tends to decrease as the training progresses, and so weights decay more slowly. For AdamW, the lack of sparsity is because the rate of decay in the weights is similar to that of SGD since the weight-decay term is decoupled from the step-size computation of Adam. We can see that Adam achieves accuracy similar to that of AMSGRAD but reduces more than 50% of parameters. However, there is still a gap in accuracy between Adam and mSGD. We suspect that this gap may result from the implicit weight sparsity of Adam.

Next, we observed relationships between activation rate of channels (ReLUs) and $L_{2}$ norm of their weight vectors during the training. We again trained a VGG-style network by Adam and mSGD with the same setups as above. We picked up 23 of 64 filters in first convolution layer, which resulted in convergence to zero with Adam. The progress of their activation rate and $L_{2}$ norm with both optimizers is illustrated in Fig. 6, respectively. Activation rate of a filter was computed as follows: counted nonzero pixels in a feature map after the ReLU layer; divided it by the feature map size; and computed it for all training images, then averaged it. In Fig. 6, we can see that $L_{2}$ norm and activation rate with Adam jointly decrease to zero while those with mSGD converge to some nonzero values as the training progress. This observation indicates that activation rate of ReLU decreases in conjunction with $L_{2}$ norm of its input weight vector. Thus, we see that the weight decays rapidly with Adam even when ReLU is not completely inactive (i.e., a condition: $g_{t}=\lambda w_{t}$ is not exactly satisfied).

V-B Comparisons on Model Reduction

We conducted comparisons with other model reduction methods in which the group sparsity of weights is explicitly induced by the regularizer. On the MNIST dataset, we compared with a method proposed by Wen et al. [11]. We used the same network model as in [11], namely, a 4-layer fully connected network without BN layer, having 500 and 300 nodes in its hidden layers. We trained it with He’s initializer [2] and the same setup as in Table I. Our method is slightly better than [11] in accuracy while comparable in reduction rate, as shown in Table VI.

Moreover, we compared with a method proposed by Scardapane et al. [12] on the CIFAR-10 dataset, using the same model and setup as the experiment in a previous subsection. We implemented their method, in which the model was trained by mSGD with initial learning rate of 0.1. The threshold of model reduction was set as $1.0\times 10^{-6}$ for the other method. For investigating the trade-off between accuracy and reduction rate, we used several parameters of the regularization $(\lambda)$ . The network was trained 5 times for each $\lambda$ with different initial weights, and the accuracy and ratio of reduced parameters were plotted in Fig. 7. It can be seen that our method basically outperforms the other method in accuracy while achieving the same reduction rate. The other method sometimes achieved low accuracy, and ours was more stable. Therefore, our method can efficiently control the model size by calibrating a parameter of $L_{2}$ regularization.

VI Conclusions

In this paper, we have analyzed the implicit weight sparsity induced by Adam, ReLU, and $L_{2}$ regularization in both theoretical and experimental aspects. Assuming that weight decay by $L_{2}$ regularization becomes dominant under the existence of less-activated ReLUs, we have mathmatically described that Adam requires $\Omega(\log\log(1/\delta))$ steps to achieve a sufficiently small weight of $\delta$ , which is faster than the $\Omega(\log(1/\delta))$ of mSGD. We believe this difference leads to sparsity of weights with Adam. Additionally, we proposed a method for model reduction by simply eliminating the zero weights after training the DNN. In experiments on MNIST and CIFAR-10 datasets, we demonstrated the sparsity with various training setups and found that other activation functions and optimizers having properties similar to ReLU and Adam, respectively, also achieve sparsity. Finally, we confirmed that our method can efficiently reduce the model size and exhibits favorable performance relative to that of other methods that use a sparsity-inducing regularizer.

Acknowledgment

TS was partially supported by MEXT Kakenhi (25730013, 25120012, 26280009, 15H05707 and 18H03201), Japan Digital Design, JST-PRESTO and JST-CREST.

Appendices

VI-A Convergence of mSGD (Proposition III.1)

The update rule of mSGD for a weight $w_{t}$ is given by

[TABLE]

The condition $(1-\alpha\lambda+\mu)^{2}-4\mu\geq 0$ implies that

[TABLE]

Since $0<\mu<1$ , $\mu$ must satisfy $\mu\leq 1+\lambda\alpha-2\sqrt{\alpha\lambda}$ under the assumption $0<\lambda\alpha<1$ . Here, letting $\mu=1+\alpha\lambda-q\sqrt{\alpha\lambda}$ for $q>2$ and defining $\beta=\frac{2-q\sqrt{\alpha\lambda}+\sqrt{(q^{2}-4)\alpha\lambda}}{2}$ and $\gamma=\frac{2-q\sqrt{\alpha\lambda}-\sqrt{(q^{2}-4)\alpha\lambda}}{2}$ , we have that

[TABLE]

The right-hand side can be evaluated as

[TABLE]

VI-B Convergence of Adam (Theorem III.2)

Let $\nu=\frac{\alpha(1-\beta_{1})}{\sqrt{1-\beta_{2}}}$ . Then, by the update rule, $w_{t}$ is recursively given by

[TABLE]

for $m_{t}=w_{t-1}+\beta_{1}w_{t-2}+\dots+\beta_{1}^{t}w_{0}$ and $v_{t}=w_{t-1}^{2}+\beta_{2}w_{t-2}^{2}+\dots+\beta_{2}^{t}w_{0}^{2}$ .

First, notice that if $\beta_{1}\leq\beta_{2}$ , then the difference between $w_{t}$ and $w_{t-1}$ is bounded by

[TABLE]

Hence, we see that, as long as $w_{t}>\nu^{\prime}$ , positivity of $w_{t+1}$ is ensured: $w_{t+1}>0$ .

Suppose that $w_{0}>0$ and $t$ is an integer such that, for $1\leq t^{\prime}\leq t$ , $w_{t^{\prime}}>\delta\geq\nu^{\prime}$ . In this setting,

[TABLE]

Therefore, we have $w_{t}<w_{t-1}<\dots<w_{0}$ , and

[TABLE]

Here, letting $\xi=1-\frac{\nu}{\sqrt{w_{0}^{2}/(1-\beta_{2})+\epsilon}}$ , we have $w_{\tau}\leq\xi^{\tau}w_{0}~{}(1\leq\tau\leq t)$ . From this observation, we again evaluate $v_{\tau}$ as

[TABLE]

Now, let $\tau^{*}$ be the smallest $\tau$ such that $(\tau+1)\max\{\xi,\beta_{2}\}^{\tau}\leq 1/2$ . By this definition, if $k\tau^{*}\leq\tau\leq(k+1)\tau^{*}$ , then

[TABLE]

Therefore, for $k\tau^{*}\leq\tau\leq(k+1)\tau^{*}$ ,

[TABLE]

Now, supposing that $k$ satisfies $\epsilon\leq(1/2)^{k}w_{0}^{2}$ (otherwise, we have $w_{\tau}^{2}\leq\epsilon$ ), the above inequality yields

[TABLE]

which is doubly exponential. Therefore, to achieve $w_{\tau+1}\leq\delta,$ we require only

[TABLE]

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Y. Le Cun, Y. Bengio, and G. Hinton, “Deep learning,” Nature , vol. 521, pp. 436–444, 2015.
2[2] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proc. IEEE International Conference on Computer Vision (ICCV) , 2015.
3[3] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proc. International Conference on Artificial Intelligence and Statistics (AISTATS) , 2011, pp. 315–323.
4[4] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research , vol. 15, pp. 1929–1958, 2014.
5[5] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. International Conference on Machine Learning (ICML) , 2015, pp. 448–456.
6[6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) , 2016.
7[7] S. Shi and X. Chu, “Speeding up convolutional neural networks by exploiting the sparsity of rectifier units,” ar Xiv preprint ar Xiv:1704.07724 , 2017.
8[8] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” in Proc. International Conference on Learning Representations (ICLR) , 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Adam Induces Implicit Weight Sparsity in Rectifier Neural Networks

Abstract

Index Terms:

I Introduction

II Proposed Method

II-A Problem Setting

II-B Model Reduction

III Theoretical Analysis

III-A Preliminaries

III-B Convergence Analysis

Proposition III.1**.**

Theorem III.2**.**

Proposition III.3**.**

IV Related Work

V Experiments

V-A Demonstration of Sparsity

V-B Comparisons on Model Reduction

VI Conclusions

Acknowledgment

Appendices

VI-A Convergence of mSGD (Proposition III.1)

VI-B Convergence of Adam (Theorem III.2)

Proposition III.1.

Theorem III.2.

Proposition III.3.