Memorized Sparse Backpropagation

Zhiyuan Zhang; Pengcheng Yang; Xuancheng Ren; Qi Su; Xu Sun

arXiv:1905.10194·cs.LG·October 28, 2020

Memorized Sparse Backpropagation

Zhiyuan Zhang, Pengcheng Yang, Xuancheng Ren, Qi Su, Xu Sun

PDF

TL;DR

This paper introduces a unified sparse backpropagation framework and a novel algorithm called memorized sparse backpropagation (MSBP), which preserves information and accelerates neural network training.

Contribution

It provides a theoretical analysis of sparse backpropagation and proposes MSBP to mitigate information loss while maintaining efficiency.

Findings

01

MSBP effectively reduces information loss in sparse backpropagation

02

MSBP achieves comparable acceleration to traditional methods

03

Theoretical analysis confirms convergence properties

Abstract

Neural network learning is usually time-consuming since backpropagation needs to compute full gradients and backpropagate them across multiple layers. Despite its success of existing works in accelerating propagation through sparseness, the relevant theoretical characteristics remain under-researched and empirical studies found that they suffer from the loss of information contained in unpropagated gradients. To tackle these problems, this paper presents a unified sparse backpropagation framework and provides a detailed analysis of its theoretical characteristics. Analysis reveals that when applied to a multilayer perceptron, our framework essentially performs gradient descent using an estimated gradient similar enough to the true gradient, resulting in convergence in probability under certain conditions. Furthermore, a simple yet effective algorithm named memorized sparse…

Tables7

Table 1. Table 1: The time and memory complexity of backpropagation for a linear layer with input size n 𝑛 n and output size m 𝑚 m . We adopt t o p k 𝑡 𝑜 subscript 𝑝 𝑘 top_{k} as the sparsifying function.

Method	Time	Memory
Linear	$O (m n)$	$O (m n)$
+ SBP	$O (m k + n \log k)$	$O (m n)$
+ MSBP	$O (m k + n \log k)$	$O (m n)$

Table 2. Table 2: Results of time cost and evaluation scores. h ℎ h , r 𝑟 r and γ 𝛾 \gamma refer to the hidden size, sparse ratio and memory ratio of our models (SBP, MSBP share the same h ℎ h with baseline and MSBP shares the same r 𝑟 r with SBP) . BP (s) and Total (s) refer to the backpropagation time cost and the total time cost on CPU in seconds ( a × a\times is compared to baseline) . Acc (%) and UAS (%) refer to the averaged accuracy and unlabeled attachment score, respectively ( ± a plus-or-minus 𝑎 \pm a is compared to baseline) .

MNIST	BP (s)	Total (s)	Acc (%)
MLP ( $h$ =500)	67.2 (1.00 $\times$ )	116.6 (1.00 $\times$ )	97.86 (+0.00)
+ SBP ( $r$ =0.04)	6.6 (10.18 $\times$ )	54.5 (2.14 $\times$ )	97.84 (-0.02)
+ MSBP ( $γ$ =0.8)	6.9 (9.74 $\times$ )	55.4 (2.10 $\times$ )	98.23 (+0.37)
Parsing	BP (s)	Total (s)	UAS (%)
MLP ( $h$ =500)	6447 (1.00 $\times$ )	9016 (1.00 $\times$ )	88.38 (+0.00)
+ SBP ( $r$ =0.04)	682 (9.46 $\times$ )	2886 (3.12 $\times$ )	88.59 (+0.21)
+ MSBP ( $γ$ =0.7)	684 (9.43 $\times$ )	2898 (3.11 $\times$ )	89.03 (+0.65)
POS-Tag	BP (s)	Total (s)	Acc (%)
LSTM ( $h$ =500)	11965 (1.00 $\times$ )	16052 (1.00 $\times$ )	97.27 (+0.00)
+ SBP ( $r$ =0.04)	1763 (6.79 $\times$ )	5738 (2.80 $\times$ )	97.34 (+0.07)
+ MSBP ( $γ$ =0.8)	1842 (6.50 $\times$ )	5849 (2.74 $\times$ )	97.50 (+0.23)

Table 3. Table 3: Results of different approaches on TextCNN. Acc denotes the averaged accuracy.

Subjectivity	Acc (%)
TextCNN	93.66 (+0.00)
+ SBP ( $r$ =0.05)	93.77 (+0.11)
+ MSBP ( $r$ =0.05, $γ$ =0.6)	93.80 (+0.14)
Polarity	Acc (%)
TextCNN	80.89 (+0.00)
+ SBP ( $r$ =0.05)	81.12 (+0.23)
+ MSBP ( $r$ =0.05, $γ$ =0.3)	81.48 (+0.58)

Table 4. Table 4: Results of different approaches on Sequence-to-Sequence models. Dev BLEU and Test BLEU refer to the BLEU score on development set and test set.

En-Vi	Dev BLEU	Test BLEU
Seq2seq ( $h$ =512)	25.87	28.45 (+0.00)
+ SBP ( $r$ =1/16)	26.64	28.94 (+0.39)
+ MSBP ( $r$ =1/16, $γ$ =0.05)	26.53	29.10 (+0.65)
Chs-En	Dev BLEU	Test BLEU
Seq2seq ( $h$ =512)	38.54	35.86 (+0.00)
+ SBP ( $r$ =1/16)	38.74	35.94 (+0.08)
+ MSBP ( $r$ =1/16, $γ$ =0.05)	38.28	35.98 (+0.12)

Table 5. Table 5: Results of time cost and evaluation scores on extremely sparse scenarios.

MNIST	BP (s)	Total (s)	Acc (%)
MLP( $h$ =5000)	3667.5 (1.00 $\times$ )	5744.1 (1.00 $\times$ )	98.10(+0.00)
+ SBP ( $r$ =0.001)	57.0 (64.34 $\times$ )	2131.8 (2.71 $\times$ )	96.19(-1.93)
+ MSBP ( $r$ =0.001, $γ$ =0.8)	58.8 (62.37 $\times$ )	2139.0 (2.70 $\times$ )	97.71(-0.39)
+ SBP ( $r$ =0.002)	102.1 (35.92 $\times$ )	2196.4 (2.63 $\times$ )	96.22 (-1.90)
+ MSBP ( $r$ =0.002, $γ$ =0.8)	104.7 (35.03 $\times$ )	2206.8 (2.62 $\times$ )	98.16 (+0.06)

Table 6. Table 6: Results of t 𝑡 t -values under different settings.

	$0.1$	$0.2$	$0.3$	$0.4$	$0.5$	$0.6$	$0.7$	$0.8$	$0.9$
$5$	$16.2$	$14.6$	$18.0$	$17.0$	$19.3$	$18.8$	$18.2$	$18.5$	$16.4$
$10$	$11.1$	$12.0$	$11.2$	$12.6$	$10.7$	$11.2$	$11.8$	$11.6$	$11.1$
$20$	$6.3$	$5.2$	$5.8$	$5.8$	$5.1$	$6.2$	$5.1$	$6.4$	$5.8$

Table 7. Table 7: Results of F 𝐹 F -values under different settings.

	$0.1$	$0.2$	$0.3$	$0.4$	$0.5$	$0.6$	$0.7$	$0.8$	$0.9$
$5$	$4.9$	1.9	$4.0$	$2.6$	$7.3$	$3.8$	$4.4$	$3.4$	$4.0$
$10$	$2.9$	$5.5$	$2.2$	$2.1$	$2.2$	$2.8$	$3.5$	$3.2$	$3.6$
$20$	$6.2$	$2.9$	$5.2$	$3.6$	$4.3$	$5.7$	$2.7$	$3.0$	$7.0$

Equations299

\ell(\mathbf{w};\mathcal{D})=\frac{1}{|\mathcal{D}|}\sum_{(\mathbf{x},\mathbf{y})\in\mathcal{D}}\ell\big{(}\mathbf{w};(\mathbf{x},\mathbf{y})\big{)}

\ell(\mathbf{w};\mathcal{D})=\frac{1}{|\mathcal{D}|}\sum_{(\mathbf{x},\mathbf{y})\in\mathcal{D}}\ell\big{(}\mathbf{w};(\mathbf{x},\mathbf{y})\big{)}

∠ ⟨ a, b ⟩ = arccos \frac{a \cdot b}{∥ a ∥∥ b ∥} \in [0, π]

∠ ⟨ a, b ⟩ = arccos \frac{a \cdot b}{∥ a ∥∥ b ∥} \in [0, π]

ϕ (ℓ) = arccos \frac{μ}{L} \in (0, \frac{π}{2})

ϕ (ℓ) = arccos \frac{μ}{L} \in (0, \frac{π}{2})

δ (v) = ∠ ⟨ g^{v}, \frac{\partial ℓ}{\partial v} ⟩

δ (v) = ∠ ⟨ g^{v}, \frac{\partial ℓ}{\partial v} ⟩

⟨ S_{I_{k}} (v), v ⟩ \leq arccos \frac{k}{n}

⟨ S_{I_{k}} (v), v ⟩ \leq arccos \frac{k}{n}

t o p_{k} (v) = I_{k} (v) ⊙ v

t o p_{k} (v) = I_{k} (v) ⊙ v

w_{t + 1} = w_{t} - η_{t} g_{t}^{w}

w_{t + 1} = w_{t} - η_{t} g_{t}^{w}

∥ w_{t + 1} - w^{*} ∥ \leq sin θ ∥ w_{t} - w^{*} ∥

∥ w_{t + 1} - w^{*} ∥ \leq sin θ ∥ w_{t} - w^{*} ∥

∥ w_{T} - w^{*} ∥ \leq ϵ

∥ w_{T} - w^{*} ∥ \leq ϵ

h

h

z

\frac{\partial ℓ}{\partial h}

\frac{\partial ℓ}{\partial h}

\frac{\partial ℓ}{\partial W}

\frac{\partial ℓ}{\partial x}

\frac{\partial ℓ}{\partial h}

\frac{\partial ℓ}{\partial h}

\frac{\partial ℓ}{\partial W}

\frac{\partial ℓ}{\partial x}

∣ D ∣ \to \infty lim P (δ (w) < θ) = 1

∣ D ∣ \to \infty lim P (δ (w) < θ) = 1

\frac{\partial ℓ}{\partial h}

\frac{\partial ℓ}{\partial h}

\frac{\partial ℓ}{\partial W}

\frac{\partial ℓ}{\partial x}

m \leftarrow γ (\frac{\partial ℓ}{\partial h} + m - S_{I_{k}} (\frac{\partial ℓ}{\partial h} + m))

m \leftarrow γ (\frac{\partial ℓ}{\partial h} + m - S_{I_{k}} (\frac{\partial ℓ}{\partial h} + m))

\ell(\mathbf{w};\mathcal{D})=\frac{1}{|\mathcal{D}|}\sum_{(\mathbf{x},\mathbf{y})\in\mathcal{D}}\ell\big{(}\mathbf{w};(\mathbf{x},\mathbf{y})\big{)}

\ell(\mathbf{w};\mathcal{D})=\frac{1}{|\mathcal{D}|}\sum_{(\mathbf{x},\mathbf{y})\in\mathcal{D}}\ell\big{(}\mathbf{w};(\mathbf{x},\mathbf{y})\big{)}

∠ ⟨ a, b ⟩ = arccos \frac{a \cdot b}{∥ a ∥∥ b ∥} \in [0, π] .

∠ ⟨ a, b ⟩ = arccos \frac{a \cdot b}{∥ a ∥∥ b ∥} \in [0, π] .

ϕ (ℓ) = arccos \frac{μ}{L} \in (0, \frac{π}{2})

ϕ (ℓ) = arccos \frac{μ}{L} \in (0, \frac{π}{2})

δ (v) = ∠ ⟨ g^{v}, \frac{\partial ℓ}{\partial v} ⟩

δ (v) = ∠ ⟨ g^{v}, \frac{\partial ℓ}{\partial v} ⟩

⟨ S_{I_{k}} (v), v ⟩ \leq arccos \frac{k}{n}

⟨ S_{I_{k}} (v), v ⟩ \leq arccos \frac{k}{n}

t o p_{k} (v) = I_{k} (v) ⊙ v

t o p_{k} (v) = I_{k} (v) ⊙ v

w_{t + 1} = w_{t} - η_{t} g_{t}^{w}

w_{t + 1} = w_{t} - η_{t} g_{t}^{w}

∥ w_{t + 1} - w^{*} ∥ \leq sin θ ∥ w_{t} - w^{*} ∥

∥ w_{t + 1} - w^{*} ∥ \leq sin θ ∥ w_{t} - w^{*} ∥

∥ w_{T} - w^{*} ∥ \leq ϵ

∥ w_{T} - w^{*} ∥ \leq ϵ

∣ D ∣ \to \infty lim P (δ (w) < θ) = 1

∣ D ∣ \to \infty lim P (δ (w) < θ) = 1

∠ ⟨ a, b ⟩ \leq ∠ ⟨ a, c ⟩ + ∠ ⟨ b, c ⟩

∠ ⟨ a, b ⟩ \leq ∠ ⟨ a, c ⟩ + ∠ ⟨ b, c ⟩

cos ∠ ⟨ A^{T} u, A^{T} v ⟩ \geq ρ cos ∠ ⟨ u, v ⟩ + 1 - ρ

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Memorized Sparse Backpropagation

Zhiyuan Zhang

Pengcheng Yang

Xuancheng Ren

Qi Su

Xu Sun

MOE Key Laboratory of Computational Linguistics, School of EECS, Peking University, Beijing 100871, China.

School of Foreign Languages, Peking University, Beijing 100871, China.

Abstract

Neural network learning is usually time-consuming since backpropagation needs to compute full gradients and backpropagate them across multiple layers. Despite its success of existing works in accelerating propagation through sparseness, the relevant theoretical characteristics remain under-researched and empirical studies found that they suffer from the loss of information contained in unpropagated gradients. To tackle these problems, this paper presents a unified sparse backpropagation framework and provides a detailed analysis of its theoretical characteristics. Analysis reveals that when applied to a multilayer perceptron, our framework essentially performs gradient descent using an estimated gradient similar enough to the true gradient, resulting in convergence in probability under certain conditions. Furthermore, a simple yet effective algorithm named memorized sparse backpropagation (MSBP) is proposed to remedy the problem of information loss by storing unpropagated gradients in memory for learning in the next steps. Experimental results demonstrate that the proposed MSBP is effective to alleviate the information loss in traditional sparse backpropagation while achieving comparable acceleration.

keywords:

Neural Networks, Backpropagation, Sparse Gradient, Acceleration.

††journal: Neurocomputing

1 Introduction

Training neural networks tends to be time-consuming [1, 2, 3], especially for architectures with a large number of learnable model parameters. An important reason why neural network learning is typically slow is that backpropagation requires the calculation of full gradients and updates all parameters in each learning step [4]. As deep networks with massive parameters become more prevalent, more and more efforts are devoted to accelerating the process of backpropagation. Among existing efforts, a prominent research line is sparse backpropagation [4, 5, 6], which aims at sparsifying the full gradient vector to achieve significant savings on computational cost.

One effective solution for sparse backpropagation is top- $k$ sparseness, which only keeps $k$ elements with the largest magnitude in the gradient vector and backpropagates them across different layers. For instance, meProp [4] employs the top- $k$ sparseness to compute only a very small but critical portion of the gradient information and update corresponding model parameters for the linear transformation. Going a step further, [5] implements the top- $k$ sparseness for backpropagation on convolutional neural networks. Experimental results demonstrate that these methods can achieve a significant acceleration of the backpropagation process. However, despite its success in saving computational cost, the top- $k$ sparseness for backpropagation still suffers from some intractable drawbacks, elaborated on as follows.

On the theoretical side, the theoretical characteristics of sparse backpropagation, especially for top- $k$ sparseness [7, 4, 5], have not been fully explored. Most previous work focuses on illustrating empirical explanations, rather than providing theoretical guarantees. Towards filling this gap, we first present a unified sparse backpropagation framework, of which some existing work [4, 5] can prove to be special cases. Furthermore, we analyze the theoretical characteristics of the proposed framework, which provides theoretical explanations for some related works [7, 4, 5]. The relevant analysis illustrates that when applied to a multilayer perceptron, the proposed framework essentially employs an estimated gradient similar enough to the true gradient to perform gradient descent, which leads to convergence under certain conditions.

On the empirical side, we find that top- $k$ sparseness for backpropagation tends to cause information loss contained in unpropagated gradients. Although it can propagate the most crucial gradient information by keeping only $k$ elements with the largest magnitude in the gradient vector, the unpropagated gradient may also contain substantial useful information. Such information loss usually results in some adverse effects like poor stability in model performance. The model performance means task-specific evaluation scores, such as accuracy on classification tasks here. To remedy this, we propose memorized sparse backpropagation (MSBP), which stores unpropagated gradients in memory for the next step of learning while propagating a critical portion of the gradient information. Compared to the previous works [4, 5], the proposed MSBP is capable of alleviating the information loss with the memory mechanism, thus improving model performance significantly. To sum up, the main contributions of this work are two-fold:

We present a unified sparse backpropagation framework and prove that some existing methods [4, 5] are special cases under this framework. In addition, the theoretical characteristics of the proposed framework are analyzed in detail to provide theoretical accounts for related work. 2. 2.

We propose memorized sparse backpropagation, which aims at alleviating the information loss by storing unpropagated gradients in memory for the next step of learning. The experiments demonstrate that our approach is able to effectively alleviate information loss while achieving comparable acceleration.

2 Related Work

When training neural networks, the gradient to be backpropagated is not necessarily the true gradient. Synthetic gradients method [8] allows layers to be trained without waiting for the true error gradient backpropagated from the previous layer. The Direct Feedback Alignment method [9] suggests that the weights used for gradient calculation in backward propagation do not have to be symmetric with the weights used for forward calculation. Furthermore, by combining extremely sparse connections with feedback-alignment causes a small accuracy drop while reducing multiply-and-accumulate (MAC) operations and data transmission cost [10]. The calculation of the gradient also allows for the locality. Local Propagation [11] and Alternating Direction Method of Multipliers (ADMM) [12] calculate gradients locally to avoid long dependencies among variable gradients and enable the parallelization of the training computations over the neural units.

Since estimated gradients, rather than true gradients, can be used for backpropagation, a prominent research line to accelerate backpropagation is sparse backpropagation. It accelerates neural network training, which tends to be time-consuming [1, 2, 3], by sparsifying gradients in backpropagation. For instance, a hardware-oriented structural sparsifying method [6] is invented for LSTM, which enforces a fixed level of sparsity in the LSTM gate gradients, yielding block-based sparse gradient matrices. [4] proposes meProp for linear transformation, which employs top- $k$ sparseness to computes only a small but critical portion of gradients and updates corresponding model parameters. Furthermore, mePorp can also be extended to convolutional layers [5] and deep sequence-to-sequence models [7].

Besides sparse backpropagation, sparse gradient methods calculate the true gradients in backpropagation but sparsify gradients for parameter updates or communication in a distributed system. Many efforts are devoted to analyzing the convergence of sparse gradient method [13, 14, 15]. [13] proposes to equip sparse gradient methods with an error memory in a distributed system.

Sparse coding [16] is a kind of unsupervised methods to represent data or features efficiently and accelerate neural networks training, whose plausibility is tested in the literature biologically [17, 18, 19]. The sparse auto-encoder [2] learns sparse features with an energy-based model to represent them efficiently. In order to train outrageously large neural networks, [20] introduces Sparsely-Gated Mixture-of-Experts (MoE) layer. A gating network is utilized to select a sparse combination of the expert networks in MoE. Sparse representation has also been applied to computer vision problems, for example, [21] proposes a sparse temporal encoding method to get visual features for robust object recognition.

There are also many approaches that do not utilize sparsity to accelerate network learning. For example, [22] proposes an adaptive acceleration strategy for backpropagation while [23] performs local adaptation of parameter update based on error function. To speed up the computation of the softmax layer, [1] utilizes importance sampling to make the training more efficient. [24] presents dropout, which improves training speed and reduces overfitting by randomly dropping units from the neural network during training. From the perspective of distributed systems, [3] proposes a one-bit-quantizing mechanism to reduce the communication cost between multiple machines.

3 Preliminary

This section presents some preliminary preparations. Given the dataset $\mathcal{D}=\{(\mathbf{x},\mathbf{y})\}$ , the training loss of an input instance $\mathbf{x}$ is defined as $\ell\big{(}\mathbf{w};(\mathbf{x},\mathbf{y})\big{)}$ , where $\mathbf{w}$ denotes the learnable model parameters and $\ell(\cdot,\cdot)$ is some loss function such as $\ell_{2}$ or logistic loss. Further, the training loss on the whole dataset $\mathcal{D}$ is defined as:

[TABLE]

We represent the angle between the vector $\mathbf{a}$ and the vector $\mathbf{b}$ as:

[TABLE]

For $\mu$ -strongly convex111Suppose $\ell(\mathbf{w})=\ell(\mathbf{w};\mathcal{D})$ , $\ell(\mathbf{w})$ is $\mu$ -strongly convex if and only if for any vectors $\mathbf{a},\mathbf{b}$ , $\ell(\mathbf{b})\geq\ell(\mathbf{a})+\nabla\ell(\mathbf{a})\bm{\cdot}(\mathbf{b}-\mathbf{a})+\frac{\mu}{2}\|\mathbf{b}-\mathbf{a}\|^{2}$ . and $L$ -smooth222Suppose $\ell(\mathbf{w})=\ell(\mathbf{w};\mathcal{D})$ , $\ell(\mathbf{w})$ is $L$ -smooth if and only if for any vectors $\mathbf{a},\mathbf{b}$ , $\ell(\mathbf{b})\leq\ell(\mathbf{a})+\nabla\ell(\mathbf{a})\bm{\cdot}(\mathbf{b}-\mathbf{a})+\frac{L}{2}\|\mathbf{b}-\mathbf{a}\|^{2}$ . functions, $\mu/L$ often plays an important role in the convergence analysis [25, 26, 27]. [27] uses the condition number $L/\mu$ to characterize the performance of their methods. We define the convex-smooth angle to characterize the convergence of our method:

Definition 1 (Convex-smooth angle).

If the training loss $\ell=\ell(\mathbf{w};\mathcal{D})$ on the dataset $\mathcal{D}$ is $\mu$ -strongly convex and $L$ -smooth for parameter vector $\mathbf{w}$ , the convex-smooth angle of $\ell$ is defined as:

[TABLE]

Then we define the gradient estimation angle to measure estimation:

Definition 2 (Gradient estimation angle).

For any vector $\mathbf{v}$ and training loss $\ell$ on an instance or whole dataset, we use $\mathbf{g^{v}}$ to represent an estimation of the true gradient $\frac{\partial\ell}{\partial\mathbf{v}}$ . Then, the gradient estimation angle between the estimated gradient $\mathbf{g^{v}}$ and the true gradient $\frac{\partial\ell}{\partial\mathbf{v}}$ is defined as:

[TABLE]

The definition of sparsifying function is represented as:

Definition 3 (Sparsifying function).

Given an integer $k\in[0,n]$ , the function $S_{\mathbb{I}_{k}}(\cdot)$ is defined as $S_{\mathbb{I}_{k}}(\mathbf{v})=\mathbb{I}_{k}(\mathbf{v})\odot\mathbf{v}$ , where $\mathbf{v}\in\mathbb{R}^{n}$ is the input vector, and $\mathbb{I}_{k}(\mathbf{v})$ is a binary vector consisting of $k$ ones and $n-k$ zeros determined by $\mathbf{v}$ . If $S_{\mathbb{I}_{k}}(\mathbf{v})$ satisfies that $\forall\mathbf{v}\in\mathbb{R}^{n}$ :

[TABLE]

we call $S_{\mathbb{I}_{k}}(\mathbf{v})$ sparsifying function and define its sparse ratio as $r=k/n$ .

Definition 4 ( $top_{k}$ ).

Given an integer $k\in[0,n]$ , for vector $\mathbf{v}=(v_{1},\cdots,v_{n})^{\rm T}\in\mathbb{R}^{n}$ where $|v_{\pi_{1}}|\geq\cdots\geq|v_{\pi_{n}}|$ , the $top_{k}$ function is defined as:

[TABLE]

where the $i$ -th element of $\mathbb{I}_{k}(\mathbf{v})$ is $\mathbb{I}(i\in\{\pi_{1},\cdots,\pi_{k}\})$ . In other words, the $top_{k}$ function only preserves $k$ elements with the largest magnitude in the input vector.

It is straightforward that $top_{k}$ is a special sparsifying function (see Appendix B.3).

4 A Unified Sparse Backpropagation Framework

This section presents a unified framework for sparse backpropagation (SBP), which can be used to explain some existing representative approaches [4, 5]. We first define the estimated gradient descent (EGD) algorithm and then formally introduce the proposed framework.

4.1 Estimated Gradient Descent

Here we introduce the definition of the EGD algorithm, which serves as a base for analyzing the convergence of sparse backpropagation.

Definition 5 (EGD).

Suppose $\ell=\ell(\mathbf{w};\mathcal{D})$ is the training loss defined on the dataset $\mathcal{D}$ and $\mathbf{w}\in\mathbb{R}^{n}$ is the parameter vector to learn. The EGD algorithm adopts the following parameter update:

[TABLE]

where $\mathbf{w}_{t}$ is the parameter at time-step $t$ , $\eta_{t}>0$ is the learning rate, and $\mathbf{g}_{t}^{\mathbf{w}}$ is an estimation of the true gradient $\frac{\partial\ell}{\partial\mathbf{w}_{t}}$ for parameter updates.

Some existing optimizers can be regarded as special cases of EGD. For instance, when $\mathbf{g}_{t}^{\mathbf{w}}$ is defined as the true gradient $\frac{\partial\ell}{\partial\mathbf{w}_{t}}$ , EGD is essentially the gradient descent (GD) algorithm. Several other works (e.g. Adam [28], AdaDelta [29]) can also be summarized as different expressions of EGD when $\mathbf{g}_{t}^{\mathbf{w}}$ is implemented as different estimates. More importantly, in essence, the sparse backpropagation employs the estimated gradient to approximate the true gradient for model training, which can also be regarded as a special case of EGD. This connection casts the cornerstone of subsequent theoretical analysis of sparse backpropagation.

In this work, we theoretically show that once the gradient estimation angle $\delta(\mathbf{w}_{t})$ of the parameter $\mathbf{w}_{t}$ satisfies certain conditions for each time-step $t$ , the EGD algorithm can converge to the global minima $\mathbf{w}^{*}$ under some reasonable assumptions. This conclusion is demonstrated in Theorem 1. Readers can refer to Appendix C.1 for the detailed proofs.

Theorem 1 (Convergence of EGD).

Suppose $\mathbf{w}_{t}$ is the parameter vector of time-step $t$ , $\mathbf{w}^{*}$ is the global minima, and training loss $\ell=\ell(\mathbf{w};\mathcal{D})$ defined on the dataset $\mathcal{D}$ is $\mu$ -strongly convex and $L-$ smooth. When applying the EGD algorithm to minimize $\ell$ , if the gradient estimation angle $\delta(\mathbf{w}_{t})$ of $\mathbf{w}_{t}$ satisfies $\delta(\mathbf{w}_{t})+\phi(\ell)\leq\theta<{\pi}/{2}$ , then there exists a learning rate $\eta_{t}>0$ for each time-step $t$ such that

[TABLE]

Furthermore, $\mathbf{w}_{t}$ converges to $\mathbf{w}^{*}$ and the convergence speed is $O(\log\frac{1}{\epsilon})$ :

$\forall\epsilon\in(0,\|\mathbf{w}_{0}-\mathbf{w}^{*}\|),\exists T(\epsilon)=\log\frac{\|\mathbf{w}_{0}-\mathbf{w}^{*}\|}{\epsilon}\big{/}\log{\frac{1}{\sin\theta}}$ * s.t. $\forall T\geq T(\epsilon)$ *

[TABLE]

where $T(\epsilon)$ is the maximum iteration number required for convergence (depending on $\varepsilon$ ).

For the given training loss $\ell=\ell(\mathbf{w};\mathcal{D})$ , $\phi(\ell)$ is a fixed value. Therefore, the Theorem 1 demonstrates that the EGD algorithm can converge to the global minima $\mathbf{w}^{*}$ and the convergence speed is $O(\log\frac{1}{\epsilon})$ when the gradient estimation angle $\delta(\mathbf{w}_{t})$ between the estimated gradient $\mathbf{g}_{t}^{\mathbf{w}}$ and the true gradient $\frac{\partial\ell}{\partial\mathbf{w}_{t}}$ is small enough at each time-step.

The insights gained from Theorem 1 can be generalized to non-convex loss functions, as evidenced in [30]. Notice that, even for non-convex loss functions $\ell$ , there still exist neighbourhoods of every local minima $\mathbf{w}^{*}$ , $\mathcal{A}\subset\mathbb{R}^{n}$ , where the loss function is restricted $\mu$ -strongly convex444 $\ell(\mathbf{w})=\ell(\mathbf{w};\mathcal{D})$ is restricted $\mu$ -strongly convex on $\mathcal{A}$ if and only if for any vectors $\mathbf{a},\mathbf{b}\in\mathcal{A}$ , $\ell(\mathbf{b})\geq\ell(\mathbf{a})+\nabla\ell(\mathbf{a})\bm{\cdot}(\mathbf{b}-\mathbf{a})+\frac{\mu}{2}\|\mathbf{b}-\mathbf{a}\|^{2}$ . and restricted $L$ -smooth555 $\ell(\mathbf{w})=\ell(\mathbf{w};\mathcal{D})$ is restricted $L$ -smooth on $\mathcal{A}$ if and only if for any vectors $\mathbf{a},\mathbf{b}\in\mathcal{A}$ , $\ell(\mathbf{b})\leq\ell(\mathbf{a})+\nabla\ell(\mathbf{a})\bm{\cdot}(\mathbf{b}-\mathbf{a})+\frac{L}{2}\|\mathbf{b}-\mathbf{a}\|^{2}$ .. While the theoretical assumption of the loss function being strongly convex and smooth is partially true on $\mathcal{A}$ but not applicable to the entire $R^{n}$ in Theorem 1, our results can be generalized to non-convex loss functions: when the weights $\mathbf{w}$ drop in $\mathcal{A}\subset\mathbb{R}^{n}$ , it can converge to the local minima $\mathbf{w}^{*}$ in $\mathcal{A}$ and the convergence speed is $O(\log{1\over\epsilon})$ .

Since the sparse backpropagation employs the estimated gradient to approximate the true gradient for model training, which can also be seen as a special case of EGD, Theorem 1 implies that gradient estimation angle can indicate the convergence of the sparse backpropagation algorithm.

4.2 Proposed Unified Sparse Backpropagation

In this section, we present a unified sparse backpropagation framework via sparsifying function (Definition 3). The core idea is that when performing backpropagation, the gradients propagated from the next layer are sparsified to achieve acceleration. Algorithm 1 presents the pseudo-code of our unified sparse backpropagation framework, which is described in detail as follows.

Considering that a computation unit composed of one linear transformation and one activation function is the cornerstone of various neural networks, we elaborate on our unified sparse backpropagation framework based on such a computational unit:

[TABLE]

where $\mathbf{x}\in\mathbb{R}^{n}$ is the input vector, $\mathbf{W}\in\mathbb{R}^{m\times n}$ is the parameter matrix, and $\sigma(\cdot):\mathbb{R}^{m}\to\mathbb{R}^{m}$ denotes a pointwise activation function. In MLP, if $\mathbf{x}$ represents the input of layer $l$ , $\mathbf{z}$ can represent the output of layer $l$ , which is also the input of layer $l+1$ if layer $l$ is not the last layer. Besides linear layers in MLP, the computation unit can also be fully-connected layers in CNN, gate layers in LSTM, etc.

Then, the original backpropagation computes the gradient of the parameter matrix $\mathbf{W}$ and the input vector $\mathbf{x}$ as follows:

[TABLE]

In the proposed unified sparse backpropagation framework, the sparsifying function (Definition 3) is utilized to sparsify the gradient $\frac{\partial\ell}{\partial\mathbf{h}}$ propagated from the next layer and propagates them through the gradient computation graph according to the chain rule. Note that $\frac{\partial\ell}{\partial\mathbf{h}}$ is also an estimated gradient passed from the next layer. The gradient estimations are finally performed as follows:

[TABLE]

Since $top_{k}$ is a special sparsifying function (see Section 3), some existing approaches (e.g., meProp [4], meProp-CNN [5]) based on the top- $k$ sparseness can be regarded as special cases of our framework. Depending on the specific task, the sparsifying function can be defined as the different expressions to improve model performance.

However, an intractable challenge for sparse backpropagation is the lack of theoretical analysis. To remedy this, here we analyze the theoretical characteristics of the proposed framework. With the fact that sparse backpropagation is a special case of EGD (Section 4.1), we theoretically illustrate that when applied to a multi-layer perceptron (MLP), the proposed framework can converge to the global minima in probability under several reasonable conditions, which is formalized in Theorem 2.

Theorem 2 (Gradient estimation angle of SBP).

Suppose:

(1) For the dataset dataset $\mathcal{D}$ , $|\mathcal{D}|$ , the size of dataset $\mathcal{D}$ , is large enough and data instance $(\mathbf{x},\mathbf{y})\in\mathcal{D}$ obeys independent and identical distribution (i.i.d.).

(2) The neural network is an MLP model.666There are several trivial constraints on MLP. Please refer to Appendix B.5 for more details.

(3) We apply the unified sparse backpropagation to train the neural network and set the sparse ratio of every sparsifying function in SBP as $r$ .

Then we can get an estimation of the training loss $\ell=\ell(\mathbf{w};\mathcal{D})$ and $\delta(\mathbf{w})$ , the gradient estimation angle of parameter vector $\mathbf{w}$ , satisfies:

$\forall\theta\in(0,{\pi}/{2}),\exists r\in(0,1)$ , s.t.

[TABLE]

The crucial idea to prove Theorem 2 is illustrating that the angle between the sparse gradient and the true full gradient can be small enough for every single data instance, and then prove that the gradient estimation angle of the full dataset can be small enough with probability one. Readers can refer to Appendix C.2 for the detailed proofs.

Theorem 2 reveals that the gradient estimation angle $\delta(\mathbf{w})$ of the parameter vector can be arbitrarily small with probability one when $\mathcal{D}$ is large enough. It implies that the sparse backpropagation algorithm is likely to converge because the gradient estimation angle can be small enough.

Although Theorem 2 is constrained by several additional conditions such as the base architecture of MLP, it is able to provide a degree of theoretical account for the proposed unified sparse backpropagation framework. Our efforts in these theoretical analyses are valuable because they help explain the effectiveness of not only our framework but also some existing approaches [4, 5] on the theoretical side.

5 Memorized Sparse Backpropagation

Although traditional sparse backpropagation is able to achieve significant acceleration by keeping only part of elements in the full gradient, the unpropagated gradient may also contain a certain amount of useful information. Experimental results find that such information loss tends to bring negative effects (e.g., performance degradation in extremely sparse scenarios, poor stability in performance). To remedy this, this work proposes memorized sparse backpropagation (MSBP), which aims at alleviating the information loss by storing unpropagated gradients in memory for the next step of learning

5.1 Proposed Memory Sparse Backpropagation Method

The core component of the proposed MSBP is the memory mechanism, which enables MSBP to store unpropagated gradients for the next step of learning while propagating a critical portion of the gradient information. Formally, different from the unified sparse backpropagation in Section 4.2, we adopt the following gradient estimations:

[TABLE]

where $S_{\mathbb{I}_{k}}(\cdot)$ is a given sparsifying function and $\mathbf{m}$ is the memory storing unpropagated gradients from the last learning step. Then, the memory $\mathbf{m}$ is updated by the information of unpropagated gradients at the current learning step. Formally,

[TABLE]

where $\gamma\in(0,1)$ is the memory ratio, a hyper-parameter controlling the ratio of memorizing unpropagated gradients. When $\gamma$ is set to 0, the proposed MSBP degenerates to the unified sparse backpropagation that completely discard unpropagated gradients. Before the model training begins, we initialize memory $\mathbf{m}$ to zero vector. Algorithm 2 presents the pseudo code of MSBP. Figure 1 presents the dataflow of the proposed MSBP.

Intuitively, by storing unpropagated gradients with the memory mechanism, the information loss in backpropagation due to sparseness can be alleviated. The experiments also illustrate that the proposed MSBP is more advantageous in various respects than approaches that completely discards unpropagated gradients. In fact, we find that for MSBP, the angle between the sparse gradient and true full gradient tends to be small. Furthermore, this angle is smaller than that in traditional sparse backpropagation. According to theoretical analysis in Section 4, a smaller gradient estimation angle is more conducive to model convergence. This observation explains the effectiveness of our MSBP to a certain extent on the theoretical side. Readers can refer to Section 7 for a more detailed analysis.

Comparison to sparsified SGD with memory. A work that looks similar to this paper is sparsified SGD with memory [31], which equips sparse gradients with memory. It calculates full gradients in backpropagation and sparsifies them to be communicated in a distributed system. However, the backpropagation process remains unchanged and cannot be accelerated. Different from sparsified SGD with memory, we sparsify gradients in backpropagation and both the communication and backpropagation process can be accelerated. Besides, sparsified SGD with memory is an optimization approach that can only be used in distributed systems, while our MSBP can be applied to both distributed and centralized systems.

5.2 Implementations

Following meProp [4], we adopt $S_{\mathbb{I}_{k}}(\cdot)=top_{k}(\cdot)$ as the sparsifying function.

For simplicity, we use SBP to represent the traditional sparse backpropagation that completely discards unpropagated gradients with $top_{k}(\cdot)$ sparsifying function and MSBP to represent the proposed memorized sparse backpropagation with $top_{k}(\cdot)$ sparsifying function.

5.3 Discussion of Complexity Information

Table 1 presents a comparison of the time and memory complexity of traditional SBP and our proposed MSBP. In this section, we discuss the time complexity and memory complexity of traditional SBP and our proposed MSBP.

Time complexity. The backpropagation process of the linear layer focuses on calculating gradients of $\mathbf{W}$ and $\mathbf{x}$ , the time complexity of which is $O(mn)$ . The application of SBP consists of two steps: finding top- $k$ dimensions of the gradient of $\mathbf{h}$ using a maximum heap with time complexity of $O(n\log k)$ and backpropagating only top- $k$ dimensions of gradients with time complexity of $O(mk)$ . The extra time cost of MSBP comes from adding the memory information into the gradient of $\mathbf{h}$ and updating the memory. The time complexity of these two operations is $O(n)$ , which is negligible compared to $O(mk+n\log k)$ .

Memory complexity. The analysis of memory complexity is similar. The backpropagation of the linear layer requires storing gradients of $\mathbf{W}$ and $\mathbf{x}$ , whose memory complexity is $O(mn)$ . For traditional SBP, the memory complexity of finding top- $k$ dimensions of the gradient of $\mathbf{h}$ with a maximum heap is $O(k)$ , while the backpropagation of corresponding dimensions of gradients requires no additional memory overhead. The extra memory cost of MSBP is the memory vector, the memory complexities of which are both $O(n)$ and negligible compared to $O(mn)$ .

6 Experiments

6.1 Experimental Settings

We evaluate the proposed MSBP on several typical benchmark tasks. The baselines used for comparison on each task are also introduced.

MNIST image recognition (MNIST). MNIST handwritten digit dataset [32] aims to recognize the numerical digit (0-9) of each image. The numbers of training, development, and test images are 55,000, 5000, and 10,000 respectively. The evaluation metric is classification accuracy. The base model is a 3-layer MLP.

Transition-based dependency parsing (Parsing). In this task, we use English Penn TreeBank (PTB) [33] for experiments. The numbers of training, development and test transitions are $1,900,056$ , $80,234$ and $113,368$ respectively. Each transition example contains a parsing context and its optimal transition action. The evaluation metric is the unlabeled attachment score (UAS). Following [34], the base model is an MLP-based parser. In training, gradients are clipped [35] to 5.

Part-of-speech tagging (POS-Tag). In this task, we use the standard benchmark dataset derived from Penn Treebank corpus [36]. The numbers of training and test examples are $38,219$ and $5,462$ respectively. The evaluation metric is per-word accuracy. Following [37], the base model is a $2$ -layer bi-directional LSTM (Bi-LSTM). In addition, we use $100$ -dim pre-trained GloVe [38] embeddings to initialize the word embeddings.

The epochs during training are set as $20,20$ , and $10$ on the three tasks of MNIST, Paring, and POS-Tag respectively. The dropout [24] probability is set to $0.1,0.2,0.5$ respectively. The Adam optimizer [28] with a learning rate of $10^{-3}$ is used on all three tasks. The batch size is set to $32,1024$ , and $128$ , respectively. The experiments of time cost are conducted on a computer with the configuration of Intel(R) Core(TM) i5-8400 CPU @ 2.80 GHz CPU.

Besides MNIST, Parsing, and POS Tagging tasks, we also conduct experiments on CNN and sequence-to-sequence models to illustrate the universality of our proposed method to a wide range of network architectures.

Polarity classification (Polarity) and subjectivity classification (Subjectivity). The dataset is constructed by [39]. Both tasks are designed to perform sentence classification, with accuracy as the evaluation metric. For these two tasks, every experiment is repeated for $10$ times and report the averaged accuracy on the test set. The base model is TextCNN [40]. The filter window sizes of TextCNN are $3$ , $4$ , and $5$ , with $100$ feature maps each. The optimizer is Adam and the learning rate is $10^{-3}$ . The batch size is set to $32$ . We train models for $10$ epochs.

English-Vietnamese Translation (En-Vi). The translated TED talks of IWSLT 2015 Evaluation Campaign [41], containing $133K$ training sentence pairs, are adopted as the training data. TED tst2012 and TED tst2013 are adopted as the development set and test set respectively.

Simplified Chinese-English Translation (Chs-En). Following [42], LDC simplified Chinese-English Translation dataset, containing $1.25M$ sentence pairs with about $28M$ Chinese words and about $35M$ English words, is adopted as the training data. NIST 2002 and NIST 2003-2006 are adopted as the development set and the test set respectively. The test set is merged by NIST 2003-2006.

On En-Vi and Chs-En translation tasks, the evaluation metric is BLEU score [43]. The base model is LSTM-based sequence-to-sequence (seq-to-seq) model. The encoder is a $3$ -layer bidirectional LSTM encoder and the decoder is a $3$ -layer LSTM decoder. The dropout rate is $0.4$ and $0.3$ respectively. The embedding size and hidden size are both $512$ . The attention type is Luong-style [44] and the beam search size is $10$ [45]. The optimizer is Adam and the learning rate is $10^{-3}$ . The mini-batch size is set to $64$ . The epochs during training are set as $40$ and $80$ epochs respectively on translation tasks. SBP and MSBP are applied to each hidden layer on base models.

6.2 Experimental Results

The experimental results on three tasks of MNIST, Parsing, and POS-Tag are shown in Table 2. An in-depth analysis of the results is provided from the following aspects.

Improving model performance. As shown in Table 2, the proposed MSBP achieves the best performance on all tasks. Considering that our ultimate goal is to accelerate neural network learning while achieving comparable model performance, such results are promising and gratifying. Compared to traditional SBP [4, 5], MSBP employs the memory mechanism to store unpropagated gradients. This reduces the information loss during backpropagation, leading to improvements in the model performance.

Accelerating backpropagation. In contrast to traditional SBP, our MSBP memorizes unpropagated gradients to alleviate information loss. However, a potential issue is that the introduction of memory containing unpropagated gradients may impair the acceleration of backpropagation. As shown in Table 2, either traditional SBP or our proposed MSBP is able to achieve great acceleration of backpropagation, and the latter shows an only negligible increase in computational cost compared to the former. This illustrates that our MBSP can achieve comparable acceleration while improving model performance.

Applying to CNNs and Deep Seq-to-seq Models. The SBP and MSBP methods are also applicable to convolution layers and deep sequence-to-sequence models. Following mePorop-CNN [5], The SBP method is implemented on CNNs, and then MSBP on CNNs, which is similar to the SBP method. Following alternating Top-k selection [7], the sequence-to-sequence version of SBP, the SBP, and MSBP methods are applied to the encoder and the decoder iteratively. As shown in Table 3 and 4, the proposed MSBP outperforms the baseline and the SBP method on TextCNN and deep deep sequence-to-sequence models.

6.3 Related Systems of Evaluation Tasks

This section presents evaluation scores of related systems on each task to illustrate the competitive performance of our approach. To testify the effectiveness of the proposed method, more advanced deep learning models are implemented, including the MLP, LSTM, CNN, and sequence-to-sequence models.

For MLP, the MLP based approaches can achieve around $98\%$ [46, 32] accuracy on MNIST, while our method achieves $98.23\%$ . For LSTM, the reported accuracy in existing approaches lies between $97.2\%$ to $97.4\%$ [47, 48, 49] on POS-Tag, whereas our method can achieve $97.50\%$ accuracy. As for CNN models, TextCNN [40] reports around $81.3\%$ and $93.4\%$ on polarity classification and subjectivity classification respectively, while our method achieves around $81.5\%$ and $93.8\%$ respectively.

7 Further In-Depth Analysis

7.1 Influence of different hyper-parameters.

In order to explore the influence of different hyper-parameters on model performance and stability, for experiments on MNIST dataset, we select the sparse ratio $k$ among $\{5,10,20\}$ and the memory ratio $\gamma$ among $\{0.1,0.2,0.3,\cdots,0.9\}$ . All experiments are repeated 20 times for each setup of $k$ and $\gamma$ . The mean and standard deviation of the accuracy of repeated experiments are presented in Figure 3 and Figure 3, respectively. The influence of hyper-parameters on backpropagation time cost is also explored, the results are shown in Figure 5.

Model performance. As depicted in Figure 3, smaller $k$ tends to lead to a worse performance both for both SBP and MSBP because only a small amount of gradient information is propagated when $k$ is small and the backpropagation suffers from information loss. For the same $k$ , MSBP ( $\gamma>0$ ) performs better than traditional SBP ( $\gamma=0$ ) regardless of the choice of $\gamma$ . The difference in accuracy between MSBP and traditional SBP ranges from 0.4% to 0.8%, while that between MSBP with different $\gamma$ settings lies between 0.1% and 0.3%. This implies that the performance of MSBP is not very sensitive to $\gamma$ compared to the improvement gained by the memory mechanism.

Model stability. Our proposed MSBP also has higher stability than SBP, indicating that it contributes to reducing the variance of the model performance in repeated experiments. Figure 3 shows that the traditional SBP ( $\gamma=0$ ) suffers from poor model stability in repeated experiments, whose standard deviation is nearly $1.7$ times of the base model (MLP). In contrast, all experiments conducted with MSBP ( $\gamma>0$ ) have higher stability than traditional SBP regardless of the choice of $\gamma$ .

Backpropagation time cost. In experiments, the forward propagation time costs of the base model and our proposed SBP and MSBP models are nearly the same. Therefore, we analyze the backpropagation time cost. Figure 5 shows the backpropagation time cost with different settings ( $\gamma=0.1$ ). In experiments, choices of $\gamma$ do not influence the time cost of MSBP significantly. As analyzed in Section 5.3, the extra time cost of MSBP is negligible compared to SBP, which is also verified in Figure 5. In Figure 5, backpropagation time is approximately proportional to $k$ (or the sparse ratio $r$ ), which accords our analysis in Section 5.3. Higher $k$ tends to lead to better performance but less backpropagation acceleration. It means there exists a tradeoff between backpropagation acceleration and the performance of SBP and MSBP.

7.2 Further verification.

Analysis of run-time gradient estimation angles. We further verify whether our proposed MSBP can give a more accurate estimation of the true gradient. As analyzed in Section 4, for sparse backpropagation, a smaller gradient estimation angle can better guarantee the convergence of approach. Therefore, we calculate the gradient estimation angle of the average gradients on the whole dataset in the proposed MSBP and traditional SBP to empirically explain the effectiveness of our method ( $\gamma=0.1$ ). As shown in Figure 5, higher $k$ results in smaller gradient estimation angles and for the same $k$ , the gradient estimation angles in MSBP are smaller than that in SBP. This illustrates that by employing the memory mechanism to store unpropagated gradients, the sparse gradient calculated by our approach gives a more accurate estimation of the true gradient, which is also consistent with results in Figure 3. In addition, the gap between the gradient estimation angles of SBP and MSBP tends to be bigger for lower $k$ because SBP suffers from the loss of unpropagated gradients more for lower $k$ , under which circumstances our proposal improves the performance to a larger extent.

Statistical test. Further, we conduct statistical tests to verify that: for MSBP and SBP with the same sparse ratio, 1) MSBP outperforms traditional SBP under different settings; and 2) MSBP has better stability than traditional SBP under different settings. The results show that for nearly all settings of $k$ and $\gamma$ , MSBP has both better performance and stability than traditional SBP statistically significantly $(p<0.05)$ . Please refer to Appendix D for details.

7.3 Applicability to multiple scenarios

Applicability to extremely sparse scenarios. In sparse backpropagation, the sparse ratio $r$ controls the trade-off between acceleration and model performance. In pursuit of ultra-large acceleration, $r$ tends to be set extremely small values in real-life scenarios. However, we empirically find that traditional SBP usually results in a significant degradation in model performance in this case. Table 5 shows that traditional SBP brings a $1.90\%$ or $1.93\%$ reduction in accuracy on MNIST image classification for $r=0.001$ or $r=0.002$ , which is a notable gap and ruins the practicality of applying meProp in extremely sparse cases. The reason is that for small $r$ values, only a very small amount of gradient information is propagated. Therefore, there exists serious information loss during backpropagation, leading to a significant degradation in model performance. In contrast, results show that in these extremely sparse scenarios, the loss of the accuracy of MSBP compared to the baseline is tolerable when $r=0.001$ and our MSBP works as effectively as the base model when $r=0.002$ . With the memory mechanism, the current unpropagated gradient information is stored for the next step of learning, reducing the information loss caused by sparseness.

Applicability to different base network architectures. We compare traditional SBP and the proposed MSBP on the other base network architectures, which is the CNN-based model and deep sequence-to-sequence models, to verify the universality of our approach in Section 6.2. Experimental results in Table 3 and 4 show that the proposed MSBP improves the performance of the base model on both sentence classification and machine translation tasks. This demonstrates that our MSBP is universal, which applies to different types of base networks and tasks.

Applicability to large-scale datasets. Experiments are also conducted on simplified Chinese-English translation task, which requires a large-scale dataset containing $1.25M$ sentence pairs. Experimental results in Table 4 demonstrate that the proposed MSBP can be also applied to the large-scale dataset.

8 Conclusion and Future Work

This work presents a unified sparse backpropagation (SBP) framework. Some previous representative approaches can be regarded as special cases under this framework. Besides, the theoretical characteristics of the proposed framework are analyzed in detail to provide theoretical accounts for the relevant methods. Analysis reveals that when applied to a multilayer perceptron, our framework essentially performs gradient descent using an estimated gradient similar enough to the true gradient, resulting in convergence in probability under certain conditions. Going a step further, we propose memorized sparse backpropagation (MSBP), which aims at alleviating the information loss in tradition sparse backpropagation by utilizing the memory mechanism to store unpropagated gradients. The experiments demonstrate that the proposed MSBP exhibits better performance while achieving comparable acceleration. Further analysis also shows that the performance of MSBP is not very sensitive to memory ratio compared to the improvement gained by the memory mechanism. The proposed MSBP method also has higher stability than the SBP method and the extra time cost of MSBP is negligible compared to the SBP method.

In this work, the memory ratio is set as an adjustable hyper-parameter. However, the structural characteristics of different samples show natural differences, which may make the optimal memory ratio vary. Therefore, we will study adaptive methods that can automatically control the ratio of memorizing unpropagated gradients in the future.

Appendix A Review of Definitions and Theorems in Paper

In this section, we review some important definitions and theorems introduced in the paper.

A.1 Definitions

Given the dataset $\mathcal{D}=\{(\mathbf{x},\mathbf{y})\}$ , the training loss of an input instance $\mathbf{x}$ is defined as $\ell\big{(}\mathbf{w};(\mathbf{x},\mathbf{y})\big{)}$ , where $\mathbf{w}$ denotes the learnable model parameters and $\ell(\cdot,\cdot)$ is some loss function such as $\ell_{2}$ or logistic loss. Further, the training loss on the whole dataset $\mathcal{D}$ is defined as:

[TABLE]

We represent the angle between the vector $\mathbf{a}$ and the vector $\mathbf{b}$ as:

[TABLE]

Definition 1 (Convex-smooth angle).

If the training loss $\ell=\ell(\mathbf{w};\mathcal{D})$ on the dataset $\mathcal{D}$ is $\mu$ -strongly convex and $L$ -smooth for parameter vector $\mathbf{w}$ , the convex-smooth angle of $\ell$ is defined as:

[TABLE]

Definition 2 (Gradient estimation angle).

For any vector $\mathbf{v}$ and training loss $\ell$ on an instance or whole dataset, we use $\mathbf{g^{v}}$ to represent an estimation of the true gradient $\frac{\partial\ell}{\partial\mathbf{v}}$ . Then, the gradient estimation angle between the estimated gradient $\mathbf{g^{v}}$ and the true gradient $\frac{\partial\ell}{\partial\mathbf{v}}$ is defined as:

[TABLE]

Definition 3 (Sparsifying function).

Given an integer $k\in[0,n]$ , the function $S_{\mathbb{I}_{k}}(\cdot)$ is defined as $S_{\mathbb{I}_{k}}(\mathbf{v})=\mathbb{I}_{k}(\mathbf{v})\odot\mathbf{v}$ , where $\mathbf{v}\in\mathbb{R}^{n}$ is the input vector, and $\mathbb{I}_{k}(\mathbf{v})$ is a binary vector consisting of $k$ ones and $n-k$ zeros determined by $\mathbf{v}$ . If $S_{\mathbb{I}_{k}}(\mathbf{v})$ satisfies that $\forall\mathbf{v}\in\mathbb{R}^{n}$ :

[TABLE]

we call $S_{\mathbb{I}_{k}}(\mathbf{v})$ sparsifying function and define its sparse ratio as $r=k/n$ .

Definition 4 ( $top_{k}$ ).

Given an integer $k\in[0,n]$ , for vector $\mathbf{v}=(v_{1},\cdots,v_{n})^{\rm T}\in\mathbb{R}^{n}$ where $|v_{\pi_{1}}|\geq\cdots\geq|v_{\pi_{n}}|$ , the $top_{k}$ function is defined as:

[TABLE]

where the $i$ -th element of $\mathbb{I}_{k}(\mathbf{v})$ is $\mathbb{I}(i\in\{\pi_{1},\cdots,\pi_{k}\})$ . In other words, the $top_{k}$ function only preserves $k$ elements with the largest magnitude in the input vector.

It is easy to verify that $top_{k}$ is a special sparsifying function (see Appendix.B.3.).

Definition 5 (EGD).

Suppose $\ell=\ell(\mathbf{w};\mathcal{D})$ is the training loss defined on the dataset $\mathcal{D}$ and $\mathbf{w}\in\mathbb{R}^{n}$ is the parameter vector to learn. The estimated gradient descent (EGD) algorithm adopts the following parameter update:

[TABLE]

where $\mathbf{w}_{t}$ is the parameter at time-step $t$ , $\eta_{t}>0$ is the learning rate, and $\mathbf{g}_{t}^{\mathbf{w}}$ is an estimation of the true gradient $\frac{\partial\ell}{\partial\mathbf{w}_{t}}$ for parameter updates.

A.2 Theorems

Theorem 1 (Convergence of EGD).

Suppose $\mathbf{w}_{t}$ is the parameter vector of time-step $t$ , $\mathbf{w}^{*}$ is the global minima, and training loss $\ell=\ell(\mathbf{w};\mathcal{D})$ defined on the dataset $\mathcal{D}$ is $\mu$ -strongly convex and $L-$ smooth. When applying the EGD algorithm to minimize $\ell$ , if the gradient estimation angle $\delta(\mathbf{w}_{t})$ of $\mathbf{w}_{t}$ satisfies $\delta(\mathbf{w}_{t})+\phi(\ell)\leq\theta<{\pi}/{2}$ , then there exists learning rate $\eta_{t}>0$ for each time-step $t$ such that

[TABLE]

Furthermore, $\mathbf{w}_{t}$ converges to $\mathbf{w}^{*}$ and the convergence speed is $O(\log\frac{1}{\epsilon})$ :

$\forall\epsilon\in(0,\|\mathbf{w}_{0}-\mathbf{w}^{*}\|),\exists T(\epsilon)=\log\frac{\|\mathbf{w}_{0}-\mathbf{w}^{*}\|}{\epsilon}\big{/}\log{\frac{1}{\sin\theta}}$ * s.t. $\forall T\geq T(\epsilon)$ *

[TABLE]

where $T(\epsilon)$ is the maximum iteration number required for convergence (depending on $\varepsilon$ ).

Theorem 2 (Bounded gradient estimation angle of SBP).

Suppose:

(1) For the dataset dataset $\mathcal{D}$ , $|\mathcal{D}|$ , the size of dataset $\mathcal{D}$ , is large enough and data instance $(\mathbf{x},\mathbf{y})\in\mathcal{D}$ obeys independent and identical distribution (i.i.d.).

(2) The neural network is a MLP model.

(3) We apply the unified sparse backpropagation to train the neural network and set the sparse ratio of every sparsifying function in SBP as $r$ .

Then we can get an estimation of the training loss $\ell=\ell(\mathbf{w};\mathcal{D})$ and $\delta(\mathbf{w})$ , the gradient estimation angle of parameter vector $v$ , satisfies:

$\forall\theta\in(0,{\pi}/{2}),\exists r\in(0,1)$ , s.t.

[TABLE]

Appendix B Preparation and Lemmas

Here we introduce some key definitions and lemmas throughout the appendix. All vectors and matrices are assumed to belong to the real number field. In Appendix, vectors (e.g. $\mathbf{x},\mathbf{y}$ ) and matrices (e.g. $\mathbf{W},\mathbf{A}$ ) are in bold formatting.

B.1 Vectors

We first introduce two vector-related lemmas.

Lemma 1.

For any vectors $\mathbf{a}$ , $\mathbf{b}$ and $\mathbf{c}$ , we have

[TABLE]

Lemma 2.

For matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$ ( $m\geq n$ ), suppose $\mathbf{A}\mathbf{A}^{\text{T}}$ is a positive definite matrix, the eigenvalue decomposition of $\mathbf{A}\mathbf{A}^{\text{T}}$ is $\mathbf{A}\mathbf{A}^{\text{T}}=\mathbf{P}\mathbf{\Sigma}\mathbf{P}^{\text{T}}\in\mathbb{R}^{m\times m}$ , $\mathbf{\Sigma}=\text{diag}\{\sigma_{1},\sigma_{2},\cdots\,\sigma_{m}\}$ ( $\sigma_{i}>0$ ) and $\mathbf{P}$ is an orthogonal matrix. We define ${\sigma_{\text{min}}}=\min\limits_{i}\sigma_{i}$ and ${\sigma_{\text{max}}}=\max\limits_{i}\sigma_{i}$ . If $\rho\geq\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}}\geq 1$ , for any $n$ -dimension vectors $\mathbf{u}$ and $\mathbf{v}$ , we have

[TABLE]

B.2 Loss Function

We define the loss function $\ell=\ell(\mathbf{w};\mathcal{D})$ as $\mu$ -strongly convex if for $\mu>0,\nabla^{2}\ell(\mathbf{x})\succeq\mu I$ , where $I$ denotes identity matrix. If the loss function $\ell$ is $\mu$ -strongly convex , for any vectors $\mathbf{a},\mathbf{b}$ , we have

[TABLE]

We define the loss function $\ell=\ell(\mathbf{w};\mathcal{D})$ as $L$ -smooth if for $L>0,\nabla^{2}\ell(\mathbf{x})\preceq LI$ , where $I$ denotes identity matrix. If the loss function $\ell$ is $L$ -smooth, for any vectors $\mathbf{a},\mathbf{b}$ , we have

[TABLE]

For the loss function $\ell$ , we define $\mathbf{w}^{*}$ as its global minima. If $\ell$ is $L$ -smooth, for any $\mathbf{w}$ , we have

[TABLE]

From Eq.(35) and Eq.(36), we can see,

[TABLE]

In other words, $\mu\leq L$ . When $\mu=L$ , we have $\ell(\mathbf{b})=\ell(\mathbf{a})+\nabla\ell(\mathbf{a})\bm{\cdot}(\mathbf{b}-\mathbf{a})+\frac{L}{2}\|\mathbf{b}-\mathbf{a}\|^{2}$ . When we set $\mathbf{a}=\mathbf{0}=(0,0,\cdots,0)^{\text{T}}$ and $\mathbf{b}=\mathbf{x}$ , $\ell(\mathbf{x})=\ell(\mathbf{0})+\ell(\mathbf{a})\bm{\cdot}\mathbf{x}+\frac{L}{2}\|\mathbf{x}\|^{2}$ , where it has a closed-form solution and is trival. Therefore, we assume in most cases, $0<\mu<L$ .

Back to the definition of convex-smooth angle, if the loss function $\ell$ is $\mu$ -strongly convex and $L$ -smooth ( $0<\mu<L$ ), we can see the convex-smooth angle of $\ell$ is $\phi(\ell)=\arccos\sqrt{{\mu}/{L}}\in(0,\frac{\pi}{2})$ .

B.3 $top_{k}$ Function

We will prove the $top_{k}$ function is a special sparsifying function.

Given an integer $k\in[0,n]$ , for vector $\mathbf{v}=(v_{1},\cdots,v_{n})^{\rm T}\in\mathbb{R}^{n}$ where $|v_{\pi_{1}}|\geq\cdots\geq|v_{\pi_{n}}|$ , the $top_{k}$ function is defined as $top_{k}(\mathbf{v})=\mathbb{I}_{k}(\mathbf{v})\odot\mathbf{v}$ where the $i$ -th element of $\mathbb{I}_{k}(\mathbf{v})$ is $s_{i}=\mathbb{I}(i\in\{\pi_{1},\cdots,\pi_{k}\})$ . It is easy to verify that

[TABLE]

Therefore, the $top_{k}$ function is a special sparsifying function.

B.4 Linear Layer Trained with SBP

Consider a linear layer with one linear transformation and one increasing pointwise activation function

[TABLE]

where $\mathbf{x}\in\mathbb{R}^{n}$ is the input sample, $\mathbf{W}\in\mathbb{R}^{m\times n}$ is the parameter matrix ( $m\geq n$ ), $n$ is the dimension of the input vector, $m$ is the dimension of the output vector and $\sigma$ is an increasing pointwise activation function (e.g., $\sigma(x)=x$ , $\sigma(x)=\tanh(x)$ or $\sigma(x)=\text{sigmoid}(x)$ ).

For matrix $\mathbf{W}\in\mathbb{R}^{m\times n}$ , we define flattening function to flatten it into a vector in $\mathbb{R}^{nm}$ as $flatten(\mathbf{W})=[\mathbf{W}_{:,1};\cdots;\mathbf{W}_{:,n}]$ , where $\mathbf{W}_{:,i}$ represents the $i$ -th column of $\mathbf{W}$ and the semicolon denotes the concatenation of many column vectors to a long column vector. In other words, $flatten(\mathbf{W})_{(j-1)m+i}=W_{ij}$ .

Assume $\mathbf{x}=(x_{1},x_{2},\cdots,x_{n})^{\text{T}},\mathbf{h}=(h_{1},h_{2},\cdots,h_{m})^{\text{T}}$ and $\mathbf{W}=(W_{ij})_{1\leq i\leq m,1\leq j\leq n}$ , then $h_{i}=\sum\limits_{j=1}^{n}W_{ij}x_{j}$ , when backpropagating

[TABLE]

Assume $\frac{\partial\ell}{\partial\mathbf{x}}=(\frac{\partial\ell}{\partial x_{1}},\frac{\partial\ell}{\partial x_{2}},\cdots,\frac{\partial\ell}{\partial x_{n}})^{\text{T}},\frac{\partial\ell}{\partial\mathbf{h}}=(\frac{\partial\ell}{\partial h_{1}},\frac{\partial\ell}{\partial h_{2}},\cdots,\frac{\partial\ell}{\partial h_{m}})^{\text{T}}$

and $\frac{\partial\ell}{\partial\mathbf{W}}=(\frac{\partial\ell}{\partial W_{ij}})_{1\leq i\leq m,1\leq j\leq n}$ , then

[TABLE]

In the proposed unified sparse backpropagation framework, the sparsifying function (Definition 3) is utilized to sparsify the gradient $\frac{\partial\ell}{\partial\mathbf{h}}$ propagated from the next layer and propagates them through the gradient computation graph according to the chain rule. Note that $\frac{\partial\ell}{\partial\mathbf{h}}$ is also an estimated gradient passed from the next layer. The gradient estimations are finally performed as follows:

[TABLE]

in other words,

[TABLE]

We introduce a lemma:

Lemma 3.

For a linear layer trained with SBP, the sparse ratio of the sparsifying function in SBP is $r$ . Denote $\mathbf{w}=flatten(\mathbf{W})$ .If $\ell=\ell(\mathbf{w},(\mathbf{x},\mathbf{y}))$ is the loss of MLP trained with SBP on this input instance and the input of this layer is $\mathbf{x}$ which satisfies $\|\mathbf{x}\|\neq 0$ , we use SBP to estimate $\partial\ell/\partial\mathbf{w}$ and $\partial\ell/\partial\mathbf{x}$ . suppose $\mathbf{W}\mathbf{W}^{\text{T}}$ is a positive definite matrix, the eigenvalue decomposition of $\mathbf{W}\mathbf{W}^{\text{T}}$ is $\mathbf{W}\mathbf{W}^{\text{T}}=\mathbf{P}\mathbf{\Sigma}\mathbf{P}^{\text{T}}\in\mathbb{R}^{m\times m}$ , $\mathbf{\Sigma}=\text{diag}\{s_{1},s_{2},\cdots\,s_{m}\}$ ( $s_{i}>0$ ) and $\mathbf{P}$ is an orthogonal matrix. We define ${s_{\text{min}}}=\min\limits_{i}s_{i}$ , ${s_{\text{max}}}=\max\limits_{i}s_{i}$ and ${\sigma^{\prime}_{\text{min}}}=\min\limits_{i}\sigma^{\prime}(h_{i})$ , ${\sigma^{\prime}_{\text{max}}}=\max\limits_{i}\sigma^{\prime}(h_{i})$ . It is easy to verify that $s_{\text{min}}>0$ and $\sigma^{\prime}_{\text{min}}>0$ because $\mathbf{W}\mathbf{W}^{\text{T}}$ is a positive definite matrix and $\sigma$ is increasing. If $\rho_{1}\geq\frac{s_{\text{max}}}{s_{\text{min}}}\geq 1$ , $\rho_{2}\geq(\frac{\sigma^{\prime}_{\text{max}}}{\sigma^{\prime}_{\text{min}}})^{2}\geq 1$ , $\rho_{1}\cos\delta(\mathbf{h})+1-\rho_{1}>0$ and $\rho_{2}\cos\delta(\mathbf{z})+1-\rho_{2}>0$ , then we have

[TABLE]

and

[TABLE]

B.5 MLP Trained with SBP

Consider a MLP trained with SBP, it is a $N$ -layer multi-layer perception (MLP), every layer except the last layer is a linear layer with SBP. $\mathbf{x}^{(1)}\in\mathbb{R}^{n_{1}}$ is the input of the MLP, $\mathbf{x}^{(N+1)}\in\mathbb{R}^{n_{N+1}}$ is the output of the MLP. The $i$ -th layer of MLP is defined as

[TABLE]

where $\mathbf{x}^{(i)}\in\mathbb{R}^{n_{i}}$ , $\mathbf{W}^{(i)}\in\mathbb{R}^{n_{i+1}\times n_{i}}$ and $n_{i+1}\geq n_{i}$ , $\sigma_{i}$ is an increasing pointwise activation function of layer $i$ ( $i<N$ ). Note that the last layer is not a linear layer trained with SBP. Therefore, $\sigma_{N}$ need not to be an increasing pointwise activation function. It can be softmax function, which is not a pointwise activation function.

Assume $\mathbf{w}$ is the parameter vector of MLP defined as

[TABLE]

where $\mathbf{w}^{(i)}=flatten(\mathbf{W}^{(i)})\in\mathbb{R}^{n_{i}n_{i+1}}$ .

We use the condition number to measure how sensitive the output is to perturbations in the input data and to roundoff errors made during the solution process. Define condition number of matrix $\mathbf{A}$ as $\text{cond}(\mathbf{A})=\|\mathbf{A}\|\|\mathbf{A}^{-1}\|$ , when we adopts the spectral norm $\|\mathbf{A}\|=\|\mathbf{A}\|_{2}$ , then $\text{cond}(\mathbf{A})=\frac{\lambda_{\text{max}}}{\lambda_{\text{min}}}$ , where $\lambda_{\text{max}}$ and $\lambda_{\text{min}}$ are the maximum and minimum singular value of $\mathbf{A}$ respectively.

If the condition number is small, we say the matrix is well-posed and otherwise ill-posed. If a matrix is singular, then its condition number is infinite, it is very ill-posed.

For a MLP trained with SBP, we assume that it is $\rho$ -well-posed if there exist $\rho_{1}>1$ , $\rho_{2}>1$ in any layer $i$ and any time step $t$ such that

[TABLE]

here for a $n$ -dim vector $\mathbf{v}=[v_{1},v_{2},\cdots,v_{n}]^{\text{T}}$ , we define $\text{diag}[\mathbf{v}]=\text{diag}\{v_{1},v_{2},\cdots,v_{n}\}$ .

We introduce a lemma here to ensure that the gradient estimation angle of the parameter vector can be arbitrarily small for an input instance with its label as input in MLP trained with SBP.

Lemma 4.

For a MLP trained with SBP, for any input instance $\mathbf{x}^{(1)}=\mathbf{x}$ with its label $\mathbf{y}$ which satisfies $\|\mathbf{x}\|\neq 0$ . Assume $\mathbf{w}$ is the parameter vector. If the MLP is $\rho$ -well-posed, then for any $\theta\in(0,{\pi}/{2})$ , there exsits $r\in({1}/{\rho^{2}},1)$ such that if we set the sparse ratio of every sparsifying function in SBP as $r$ , we can get $\mathbf{g}^{\mathbf{w};(\mathbf{x},\mathbf{y})}$ , an estimation of $\nabla\ell\big{(}\mathbf{w};(\mathbf{x},\mathbf{y})\big{)}$ to make gradient estimation angle satisfy $\delta(\mathbf{w})<\theta$ .

B.6 Review of the Term ”In Probability”

A sequence of random variables $X_{n}$ converges to a random variable $X$ in probability if for any $\epsilon>0$

[TABLE]

We introduce a lemma here

Lemma 5.

For a sequence of random variables $X_{n}$ , when $n\to\infty$ , if $\mathrm{Var}(X_{n})\to 0$ and $\mathrm{E}(X_{n})\to a$ , then $X_{n}$ converges to $a$ in probability.

Appendix C Proofs

C.1 Proofs of Theorem 1

Proof.

According to the Ineq.(34),

[TABLE]

According to Ineq.(39),

[TABLE]

Combining Ineq.(51) with Ineq.(52), we have

[TABLE]

in other words,

[TABLE]

we have

[TABLE]

According to Lemma 1

[TABLE]

Therefore

[TABLE]

By setting $\eta={\cos\theta\|\mathbf{w}-\mathbf{w}^{*}\|\over\|\mathbf{g}_{t}^{\mathbf{w}}\|}$ , we have

[TABLE]

Define $a(\theta)=\left.1\middle/\log{\frac{1}{\sin\theta}}\right.$ , where $a(\theta)>0$ . Then

[TABLE]

in other words,

[TABLE]

We have

[TABLE]

To ensure that $\|\mathbf{w}_{T}-\mathbf{w}^{*}\|\leq\epsilon$ , we just have to ensure that $\log\|\mathbf{w}_{T}-\mathbf{w}^{*}\|+{T\over a(\theta)}\leq\log\|\mathbf{w}_{0}-\mathbf{w}^{*}\|\leq\log\epsilon+{T\over a(\theta)}$ . In other words, we just have to ensure that $T\geq T(\epsilon)=a(\theta)\log{\|\mathbf{w}_{0}-\mathbf{w}^{*}\|\over\epsilon}$ . Therefore, $\forall\epsilon\in(0,\|\mathbf{w}_{0}-\mathbf{w}^{*}\|),\exists T(\epsilon)=\log{\|\mathbf{w}_{0}-\mathbf{w}^{*}\|\over\epsilon}\big{/}\log{\frac{1}{\sin\theta}}$ s.t.

[TABLE]

∎

C.2 Proof of Theorem 2

Proof.

Suppose $\mathcal{D}={(\mathbf{x}_{i},\mathbf{y}_{i})}_{i=1}^{n}$ , where $n$ is the number of data instances. Define

[TABLE]

then

[TABLE]

We introduce a lemma here and we will prove it later. $\mathbf{u}_{i}$ and $\mathbf{v}_{i}$ in the lemma is defined above.

Lemma 6.

Dataset $\mathcal{D}$ has $n$ independent and identically distributed (i.i.d.) data instances. Suppose for any $\theta\in(0,\frac{\pi}{2})$ and any $(\mathbf{x}_{i},\mathbf{y}_{i})\in\mathcal{D}$ we can find $r\in({1}/{\rho^{2}},1)$ such that if we set the sparse ratio of every sparsifying function in SBP as $r$ , then $\angle\langle\mathbf{u}_{i},\mathbf{v}_{i}\rangle<\theta$ and $\|\mathbf{u}_{i}\|=\|\mathbf{v}_{i}\|$ . Then for any $\epsilon\in(0,\frac{\pi}{2})$ , there exists $r\in({1}/{\rho^{2}},1)$ such that when we set the sparse ratio of every sparsifying function in SBP as $r$ , then $\lim\limits_{|\mathcal{D}|\to\infty}\mathrm{P}(\delta(\mathbf{w})<\epsilon)=1$ .

According to Lemma 4, for any $\theta\in(0,\frac{\pi}{2})$ and any $(\mathbf{x}_{i},\mathbf{y}_{i})\in\mathcal{D}$ we can find $r\in({1}/{\rho^{2}},1)$ such that if we set the sparse ratio of every sparsifying function in SBP as $r$ , then $\angle\langle\mathbf{u}_{i},\mathbf{v}_{i}\rangle<\theta$ and $\|\mathbf{u}_{i}\|=\|\mathbf{v}_{i}\|$ (Eq.(113)). We also have a large enough dataset $\mathcal{D}$ , which has $n$ independent and identically distributed (i.i.d.) data instances. The condition of Lemma 6 is satisfied.

Therefore, for $\theta\in(0,\frac{\pi}{2})$ , there exists $r\in({1}/{\rho^{2}},1)$ such that when $n\to\infty$ and we set the sparse ratio of every sparsifying function in SBP as $r$ , $\lim\limits_{|\mathcal{D}|\to\infty}\mathrm{P}(\delta(\mathbf{w})<\epsilon)=1$ . ∎

C.3 Proofs of Lemmas

Proof of Lemma 1.

Without loss of generality, we assume $\|\mathbf{a}\|=\|\mathbf{b}\|=\|\mathbf{c}\|=1$ .

Define $\mathbf{a}_{1}=\mathbf{a}-(\mathbf{a}\bm{\cdot}\mathbf{c})\mathbf{c}$ and $\mathbf{b}_{1}=\mathbf{b}-(\mathbf{b}\bm{\cdot}\mathbf{c})\mathbf{c}$ . We have

[TABLE]

For $\mathbf{b_{1}}$ , similarly $\|\mathbf{b_{1}}\|^{2}=\sin^{2}\angle\langle\mathbf{b},\mathbf{c}\rangle,\mathbf{c}\bm{\cdot}\mathbf{b}_{1}=0$ . Therefore,

[TABLE]

In other words, $\angle\langle\mathbf{a},\mathbf{b}\rangle\leq\angle\langle\mathbf{a},\mathbf{c}\rangle+\angle\langle\mathbf{b},\mathbf{c}\rangle$ . ∎

Proof of Lemma 2.

Without loss of generality, we assume $\|\mathbf{u}\|=\|\mathbf{v}\|=1$ .

$\mathbf{A}\mathbf{A}^{\text{T}}=\mathbf{P}\mathbf{\Sigma}\mathbf{P}^{\text{T}}$ where $\mathbf{\Sigma}=\text{diag}\{\sigma_{1},\sigma_{2},\cdots\,\sigma_{m}\}$ , according to singular value decomposition (SVD), we have $\mathbf{A}=\mathbf{P}\mathbf{D}\mathbf{Q}^{\text{T}}$ , where $\mathbf{P}$ and $\mathbf{Q}$ are orthogonal and $\mathbf{D}=\text{diag}\{\sqrt{\sigma_{1}},\cdots,\sqrt{\sigma_{n}}\}$ .

We can see $\mathbf{D}^{\text{T}}=\mathbf{D}$ and for any vector $\mathbf{x}=(x_{1},x_{2},...,x_{m})^{\text{T}}\in\mathbb{R}^{m}$

[TABLE]

We define $\mathbf{a}=\mathbf{P}^{\text{T}}\mathbf{u}$ and $\mathbf{b}=\mathbf{P}^{\text{T}}\mathbf{v}$ , we have

[TABLE]

and similarly

[TABLE]

we have

[TABLE]

similarly

[TABLE]

Then $\mathbf{A}^{\text{T}}\mathbf{u}=\mathbf{Q}^{\text{T}}\mathbf{D}\mathbf{a}$ and $\mathbf{A}^{\text{T}}\mathbf{v}=\mathbf{Q}^{\text{T}}\mathbf{D}\mathbf{b}$ . Consider

[TABLE]

According to Eq.(79), Eq.(80), Eq.(81), Eq.(82), Eq.(84), Eq.(85) and Eq.(86), we have

[TABLE]

and

[TABLE]

In other words,

[TABLE]

∎

Proof of Lemma 3.

First, let’s consider $\delta(\mathbf{x})$ .

According to Lemma 2

[TABLE]

In other words,

[TABLE]

Then, let’s consider $\delta(\mathbf{h})$ .

According to Lemma 1

[TABLE]

Define $\mathbf{A}=\text{diag}\{\sigma^{\prime}(h_{1}),\sigma^{\prime}(h_{2}),\cdots,\sigma^{\prime}(h_{m})\}$ ,

then $\mathbf{A}\mathbf{A}^{\text{T}}=\text{diag}\{\sigma^{\prime}(h_{1})^{2},\sigma^{\prime}(h_{2})^{2},\cdots,\sigma^{\prime}(h_{m})^{2}\}$ , according to Lemma 2

[TABLE]

In other words, $\angle\langle\sigma^{\prime}(\mathbf{h})\odot\mathbf{g^{z}},\sigma^{\prime}(\mathbf{h})\odot\frac{\partial\ell}{\partial\mathbf{z}}\rangle\leq\arccos\big{(}\rho_{2}\cos\angle\langle\mathbf{g^{z}},\frac{\partial\ell}{\partial\mathbf{z}}\rangle+1-\rho_{2}\big{)}$ .

Combined with Ineq.(101), we have

[TABLE]

Finally, let’s consider $\delta(\mathbf{w})=\delta(flatten(\mathbf{W}))$ .

Without loss of generality, we assume

[TABLE]

On one hand,

[TABLE]

On the other hand,

[TABLE]

Consider

[TABLE]

combined with Eq.(107) and Eq.(108)

[TABLE]

In other words, $\delta(\mathbf{w})=\delta(flatten(\mathbf{W}))=\delta(\mathbf{h})$ . ∎

Proof of Lemma 4.

For $\mathbf{w}=[{\mathbf{w}^{(1)}}^{\text{T}},{\mathbf{w}^{(2)}}^{\text{T}},\cdots,{\mathbf{w}^{(N)}}^{\text{T}}]^{\text{T}}\in\mathbb{R}^{n_{total}},n_{total}=n_{1}n_{2}+n_{2}n_{3}+\cdots+n_{N-1}n_{N}+n_{N}n_{N+1}$ , if we define the estimated gradient $\mathbf{g}$ as $\mathbf{g}=[\lambda^{(1)}(\mathbf{g}^{\mathbf{w}^{(1)}})^{\text{T}},\lambda^{(2)}(\mathbf{g}^{\mathbf{w}^{(2)}})^{\text{T}},\cdots,\lambda^{(N)}(\mathbf{g}^{\mathbf{w}^{(N)}})^{\text{T}}]^{\text{T}}\in\mathbb{R}^{n_{total}}$ . We use $\mathbf{g}^{\mathbf{w};(\mathbf{x},\mathbf{y})}=\mathbf{g}$ to estimate $\nabla\ell\big{(}\mathbf{w};(\mathbf{x},\mathbf{y})\big{)}$ and the estimated angle is

[TABLE]

We choose $\lambda^{(i)}={\|\frac{\partial\ell}{\partial\mathbf{w}^{(i)}}\|}\big{/}{\|\mathbf{g}^{\mathbf{w}^{(i)}}\|}$ , then we have

[TABLE]

Suppose $\delta=\max\limits_{i}\delta(\mathbf{w}^{(i)})$ , then we have

[TABLE]

In other words, $\delta(\mathbf{w})\leq\delta$ .

We will prove that there exists $r\in(\frac{1}{\rho^{2}},1)$ to ensure $\delta<\theta$ .

For a $\rho$ -well-posed $N$ -layer MLP trained with SBP, there exist $\rho_{1}>1,\rho_{2}>1$ satisfying

[TABLE]

therefore, denote $\mathbf{W}=\mathbf{W}^{(i)}$ and $\mathbf{h}^{(i+1)}=[h_{1},h_{2},\cdots,h_{i}]^{\text{T}}$ , if the eigenvalue decomposition of $\mathbf{W}\mathbf{W}^{\text{T}}$ is $\mathbf{W}\mathbf{W}^{\text{T}}=\mathbf{P}\mathbf{\Sigma}\mathbf{P}^{\text{T}}\in\mathbb{R}^{m\times m}$ ( $s_{i}>0$ ), $\mathbf{\Sigma}=\text{diag}\{s_{1},s_{2},\cdots\,s_{m}\}$ and $\mathbf{P}$ is an orthogonal matrix ( ${s_{\text{min}}}=\min\limits_{i}s_{i}$ , ${s_{\text{max}}}=\max\limits_{i}s_{i}$ ) and ${\sigma^{\prime}_{\text{min}}}=\min\limits_{i}\sigma^{\prime}(h_{i})$ , ${\sigma^{\prime}_{\text{max}}}=\max\limits_{i}\sigma^{\prime}(h_{i})$ . (It is easy to verify that $s_{\text{min}}>0$ and $\sigma^{\prime}_{\text{min}}>0$ because $\mathbf{W}\mathbf{W}^{\text{T}}$ is a positive definite matrix and $\sigma$ is increasing.) We have

[TABLE]

Note that $\rho_{1}>1,\rho_{2}>1$ satisfying the conditions in Lemma 3 for every linear layer with SBP and $\rho=\rho_{1}\rho_{2}>1$

Define $\alpha_{i}=\cos\delta(\mathbf{w}^{(i)})$ , note that the last layer is not with SBP, therefore $\alpha_{N}=1$ . For $i<N$ , if $\rho\alpha_{i+1}+1-\rho=\rho_{1}\rho_{2}\alpha_{i+1}+1-\rho_{1}\rho_{2}>0$ , we have

[TABLE]

which are conditions of Lemma 3 exactly, according to Lemma 3

[TABLE]

In other words,

[TABLE]

Define $\beta_{N}=0\geq 1-\alpha_{N}$ and for $1\leq i<N$

[TABLE]

Assume $\beta_{1}<\frac{1}{\rho}<1$ first, then

[TABLE]

therefore $\beta_{2}<\beta_{1}<\frac{1}{\rho}<1$ .

Similarly if $\beta_{i}<\frac{1}{\rho}<1$

[TABLE]

therefore $\beta_{i+1}<\beta_{i}<\cdots<\beta_{2}<\beta_{1}<\frac{1}{\rho}<1$ . In other words, $\beta_{N}<\beta_{N-1}<\cdots<\beta_{2}<\beta_{1}<\frac{1}{\rho}<1$ .

If $\beta_{i+1}\geq 1-\alpha_{i+1}$ , we have $\rho\alpha_{i+1}+1-\rho=1-\rho(1-\alpha_{i+1})>0$ (because $\frac{1}{\rho}>\beta_{1}>\beta_{i+1}$ ), which is the condition for Ineq.(128). According to Ineq.(128) and Ineq.(129)

[TABLE]

In other words, $\beta_{i}\geq 1-\alpha_{i}$ . Note $\beta_{N}\geq 1-\alpha_{N}$ , therefore $\beta_{N}<\beta_{N-1}<\cdots<\beta_{2}<\beta_{1}<\frac{1}{\rho}<1,\quad\beta_{i}\geq 1-\alpha_{i}$ .

In order to ensure $\delta<\theta$ under the assumption $\beta_{1}<\frac{1}{\rho}$

[TABLE]

we just need to ensure $\beta_{1}<\min(\frac{1}{\rho},1-\cos\theta)$ . According to Eq.(129)

[TABLE]

Denote $a=\rho\sqrt{r}>1,b=1-\sqrt{r}+\sqrt{1-r}>0$ , then

[TABLE]

Therefore, we just need to ensure

[TABLE]

Denote $f(r)=\min(\frac{1}{\rho},1-\cos\theta)-\frac{a^{N-1}-1}{a-1}b$ , where $a=\rho\sqrt{r}>1,b=1-\sqrt{r}+\sqrt{1-r}>0$ . To ensure $\delta(\mathbf{w})<\theta$ , we just need to ensure $f(r)>0$ . We have

[TABLE]

because $f(r)$ is a continuous function of $r$ , therefore there exists $r\in(\frac{1}{\rho^{2}},1)$ such that $f(r)>0$ .

∎

Proof of Lemma 5.

According to Chebyshev’s Ineq.

[TABLE]

In other words,

[TABLE]

Because $\mathrm{E}(X_{n})\to a$ in probability, we have the following in probability

[TABLE]

For event $A,B$ , we have

[TABLE]

combined with

[TABLE]

Therefore

[TABLE]

∎

Proof of Lemma 6.

For any $\theta$ , we can choose $r=r(\theta)$ to let $\angle\langle\mathbf{u}_{i},\mathbf{v}_{i}\rangle<\theta$ , we define such $r$ as $r(\theta)$ . To ensure $\delta(\mathbf{w}_{t})<\epsilon$ , we just need to ensure $\cos\delta(\mathbf{w}_{t})>\cos\epsilon$ .

Define

[TABLE]

then

[TABLE]

According to Lemma 1, $\angle\langle\mathbf{v}_{i},\mathbf{\bar{u}}\rangle\leq\angle\langle\mathbf{v}_{i},\mathbf{u}_{i}\rangle+\angle\langle\mathbf{u}_{i},\mathbf{\bar{u}}\rangle<\theta+\angle\langle\mathbf{u}_{i},\mathbf{\bar{u}}\rangle$ . Because $\theta$ can be arbitrarily small, $\theta+\angle\langle\mathbf{u}_{i},\mathbf{\bar{u}}\rangle<\pi$ can hold. Define

[TABLE]

According to Minkowski Ineq.

[TABLE]

Define $\beta=\sum\limits_{i=1}^{n}\|\mathbf{u}_{i}\|/(n\|\mathbf{\bar{u}}\|)$ , then,

[TABLE]

Combined with Eq.(156), then

[TABLE]

Consider

[TABLE]

In other words,

[TABLE]

Because $\theta$ can be arbitrarily small, $\cos\theta-\sqrt{\beta^{2}-1}\sin\theta>0$ can hold. Combined with Eq.(164), then

[TABLE]

We define $f(\theta)=\frac{1}{1+2\beta\sin\frac{\theta}{2}}(\cos\theta-\sqrt{\beta^{2}-1}\sin\theta)$ , where $\beta=\sum\limits_{i=1}^{n}\|\mathbf{u}_{i}\|/(n\|\mathbf{\bar{u}}\|)$ .

Assume $\|\mathbf{u}_{i}\|$ is i.i.d., $\mathrm{Var}\|\mathbf{u}_{i}\|$ and $\mathrm{E}\|\mathbf{u}_{i}\|$ are finite, and $\|\mathrm{E}\mathbf{u}_{i}\|>0$ . (It is reasonable because the data instances are i.i.d. and we may assume the gradients’ norm is bounded, and also if $\|\mathrm{E}\mathbf{u}_{i}\|=0$ , the network already converges to the global minimum).

Note that if $A$ and $B$ are independent, $\mathrm{Var}(A+B)=\mathrm{Var}(A)+\mathrm{Var}(B),\mathrm{E}(AB)=\mathrm{E}(A)\mathrm{E}(B)$ . We have

[TABLE]

where $\mathbf{u}_{i}^{(j)},\mathbf{\bar{u}}^{(j)}$ represent the $j$ -th dim of the vector.

According to Lemma 5 and $\mathrm{E}({\sum\limits_{i=1}^{n}\|\mathbf{u}_{i}\|}/{n})=\mathrm{E}(\|\mathbf{u}_{i}\|)$ , when $n\to\infty$ (here we consider convergence in probability),

[TABLE]

Note that we assume $\mathrm{Var}\|\mathbf{u}_{i}\|$ and $\mathrm{E}\|\mathbf{u}_{i}\|$ are finite and $\|\mathrm{E}\mathbf{u}_{i}\|>0$ . Therefore, there exists $\beta_{1}$ such that $\beta_{1}>{\mathrm{E}\|\mathbf{u}_{i}\|}/{\|\mathrm{E}\mathbf{u}_{i}\|}$ holds in every time step. Therefore when $n$ is large enough,

[TABLE]

When $\beta<\beta_{1}$ , we have

[TABLE]

To ensure $\delta(\mathbf{w}_{t})<\epsilon$ , we just need to ensure $f(\theta)>\cos\epsilon$ , consider

[TABLE]

In other words, for any $\epsilon$ , there exists $\theta$ and $r$ such that if we set the sparse ratio $r=r(\theta)$ then $\cos(\delta(\mathbf{w}_{t}))<\epsilon$ holds when $\beta<\beta_{1}$ . Therefore when $n$ is large enough,

[TABLE]

To conclude,

[TABLE]

∎

Appendix D Statistical test

In this section, statistical tests are conducted on MNIST dataset for MSBP and SBP under different settings of $k$ and $\gamma$ . We may assume that the accuracies of SBP and MSBP obey two normal distributions $N(\mu_{1},\sigma_{1})$ and $N(\mu_{2},\sigma_{2})$ respectively. Repeating times of both SBP and MSBP are $n=20$ .

First, to test whether MSBP improves the performance, Student t-tests are conducted: Null hypothesis $H_{0}:\mu_{1}\geq\mu_{2}$ , alternative hypothesis $H_{a}:\mu_{1}<\mu_{2}$ . $t$ -value $\approx 1.7$ when $p=0.05$ and the degree of freedom $df=2(n-1)=38$ . Results for different settings are shown in Table 6.

For all settings of $k$ and $\gamma$ , $t$ -values $\geq 1.7$ , that is, MSBP improves the performance of SBP statistically significantly ( $p<0.05$ ).

Then, to test whether MSBP improves the stability, F-tests are conducted: Null hypothesis $H_{0}:\sigma_{1}\leq\sigma_{2}$ , alternative hypothesis $H_{a}:\sigma_{1}>\sigma_{2}$ . $F$ -value $\approx 2.1$ when $p=0.05$ and the degrees of freedom of numerator and denominator are both $df=n-1=19$ . Results for different settings are shown in Table 7.

For nearly all settings of $k$ and $\gamma$ (except the setting where $k=5,\gamma=0.2$ , which is bold in Table 7), $F$ -values $\geq 2.1$ , that is, MSBP improves the stability of SBP statistically significantly ( $p<0.05$ ).

To conclude, the proposed MSBP method improves both the performance and stability of traditional SBP statistically significantly ( $p<0.05$ ) for nearly all settings of $k$ and $\gamma$ .

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Jean, K. Cho, R. Memisevic, Y. Bengio, On using very large target vocabulary for neural machine translation, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, The Association for Computer Linguistics, 2015, pp. 1–10.
2[2] C. Poultney, S. Chopra, Y. L. Cun, et al., Efficient learning of sparse representations with an energy-based model, in: Advances in neural information processing systems, 2007, pp. 1137–1144.
3[3] F. Seide, H. Fu, J. Droppo, G. Li, D. Yu, 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns, in: Fifteenth Annual Conference of the International Speech Communication Association, 2014.
4[4] X. Sun, X. Ren, S. Ma, H. Wang, me Prop: Sparsified back propagation for accelerated deep learning with reduced overfitting, in: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, 2017, pp. 3299–3308.
5[5] B. Wei, X. Sun, X. Ren, J. Xu, Minimal effort back propagation for convolutional neural networks, ar Xiv preprint ar Xiv:1709.05804.
6[6] M. Zhu, J. Clemons, J. Pool, M. Rhu, S. W. Keckler, Y. Xie, Structurally sparsified backward propagation for faster long short-term memory training, Co RR abs/1806.00512.
7[7] X. Sun, X. Ren, S. Ma, B. Wei, W. Li, J. Xu, H. Wang, Y. Zhang, Training simplification and model simplification for deep learning: A minimal effort back propagation method, IEEE Transactions on Knowledge and Data Engineering (2018) 1–1.
8[8] W. M. Czarnecki, G. Swirszcz, M. Jaderberg, S. Osindero, O. Vinyals, K. Kavukcuoglu, Understanding synthetic gradients and decoupled neural interfaces, in: D. Precup, Y. W. Teh (Eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, Vol. 70 of Proceedings of Machine Learning Research, PMLR, 2017, pp. 904–912.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Memorized Sparse Backpropagation

Abstract

keywords:

1 Introduction

2 Related Work

3 Preliminary

Definition 1** (Convex-smooth angle).**

Definition 2** (Gradient estimation angle).**

Definition 3** (Sparsifying function).**

Definition 4** (topktop_{k}topk​).**

4 A Unified Sparse Backpropagation Framework

4.1 Estimated Gradient Descent

Definition 5** (EGD).**

Theorem 1** (Convergence of EGD).**

4.2 Proposed Unified Sparse Backpropagation

Theorem 2** (Gradient estimation angle of SBP).**

5 Memorized Sparse Backpropagation

5.1 Proposed Memory Sparse Backpropagation Method

5.2 Implementations

5.3 Discussion of Complexity Information

6 Experiments

6.1 Experimental Settings

6.2 Experimental Results

6.3 Related Systems of Evaluation Tasks

7 Further In-Depth Analysis

7.1 Influence of different hyper-parameters.

7.2 Further verification.

7.3 Applicability to multiple scenarios

8 Conclusion and Future Work

Appendix A Review of Definitions and Theorems in Paper

A.1 Definitions

Definition 1** (Convex-smooth angle).**

Definition 2** (Gradient estimation angle).**

Definition 3** (Sparsifying function).**

Definition 4** (topktop_{k}topk​).**

Definition 5** (EGD).**

A.2 Theorems

Theorem 1** (Convergence of EGD).**

Theorem 2** (Bounded gradient estimation angle of SBP).**

Appendix B Preparation and Lemmas

B.1 Vectors

Lemma 1**.**

Lemma 2**.**

B.2 Loss Function

B.3 topktop_{k}topk​ Function

B.4 Linear Layer Trained with SBP

Lemma 3**.**

B.5 MLP Trained with SBP

Lemma 4**.**

B.6 Review of the Term ”In Probability”

Lemma 5**.**

Appendix C Proofs

C.1 Proofs of Theorem 1

Proof.

C.2 Proof of Theorem 2

Proof.

Lemma 6**.**

C.3 Proofs of Lemmas

Proof of Lemma 1.

Proof of Lemma 2.

Proof of Lemma 3.

Proof of Lemma 4.

Proof of Lemma 5.

Proof of Lemma 6.

Appendix D Statistical test

Definition 1 (Convex-smooth angle).

Definition 2 (Gradient estimation angle).

Definition 3 (Sparsifying function).

Definition 4 ( $top_{k}$ ).

Definition 5 (EGD).

Theorem 1 (Convergence of EGD).

Theorem 2 (Gradient estimation angle of SBP).

Definition 1 (Convex-smooth angle).

Definition 2 (Gradient estimation angle).

Definition 3 (Sparsifying function).

Definition 4 ( $top_{k}$ ).

Definition 5 (EGD).

Theorem 1 (Convergence of EGD).

Theorem 2 (Bounded gradient estimation angle of SBP).

Lemma 1.

Lemma 2.

B.3 $top_{k}$ Function

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6.