Taxonomy of Saliency Metrics for Channel Pruning

Kaveena Persand; Andrew Anderson; David Gregg

arXiv:1906.04675·cs.LG·May 6, 2022

Taxonomy of Saliency Metrics for Channel Pruning

Kaveena Persand, Andrew Anderson, David Gregg

PDF

TL;DR

This paper introduces a taxonomy of saliency metrics for channel pruning in neural networks, enabling better understanding, comparison, and creation of effective pruning metrics through extensive experiments.

Contribution

It proposes a novel taxonomy based on four principal components, facilitating the systematic analysis and development of saliency metrics for channel pruning.

Findings

01

A broad range of metrics can be grouped by the taxonomy

02

Reduction and scaling are crucial in pruning effectiveness

03

Some new metrics outperform state-of-the-art methods

Abstract

Pruning unimportant parameters can allow deep neural networks (DNNs) to reduce their heavy computation and memory requirements. A saliency metric estimates which parameters can be safely pruned with little impact on the classification performance of the DNN. Many saliency metrics have been proposed, each within the context of a wider pruning algorithm. The result is that it is difficult to separate the effectiveness of the saliency metric from the wider pruning algorithm that surrounds it. Similar-looking saliency metrics can yield very different results because of apparently minor design choices. We propose a taxonomy of saliency metrics based on four mostly-orthogonal principal components. We show that a broad range of metrics from the pruning literature can be grouped according to these components. Our taxonomy not only serves as a guide to prior work, but allows us to construct new…

Tables11

Table 1. Table 1: A taxonomy of published channel saliency metrics. One component from each column is chosen to construct a channel saliency metric.

Base	Pointwise Metric	Reduction	Scaling
Input	$f (x)$	$R$	$K$
$X = W$	$x$	$\sum_{x \in {}^{l}X_{i}} x$	$1$
$X = A$	$\frac{d ℒ}{d x}$	$\sum_{x \in {}^{l}X_{i}} \| x \|$	$n ({}^{l}X_{i})$
	$- x \frac{d ℒ}{d x}$	$\| \sum_{x \in {}^{l}X_{i}} x \|$	${∥ {}^{l}{\tilde{S}} ∥}_{1}$
	$- x \frac{d ℒ}{d x} + \sum_{y \notin \tilde{W}} \frac{x y}{2} \frac{d^{2} ℒ}{d x d y}$	$\sum_{x \in {}^{l}X_{i}} {(x)}^{2}$	${∥ {}^{l}{\tilde{S}} ∥}_{2}$
	$x - r e c o n s t r u c t i o n (x)$	${(\sum_{x \in {}^{l}X_{i}} x)}^{2}$	$n (𝒯 𝒞 ({}^{l}W_{i}))$
	$a l t e r n a t e B a c k p r o p (x)$	$\sqrt{\sum_{x \in {}^{l}X_{i}} {(x)}^{2}}$

Table 2. Table 2: Published approaches for fine-grain pruning.

Method	Pointwise Measure, $f (x)$
Magnitude [6, 24, 27, 11, 42]	$\| w \|$
Optimal Brain Damage [9]	$\frac{d^{2} ℒ}{d w^{2}}$
Optimal Brain Surgeon [10]	$\frac{w_{i} w_{j}}{2} \frac{d^{2} ℒ}{d w_{i} d w_{j}}$
Gradient of mask [7]	$- \frac{d ℒ}{d m}$
Gradient of weights [8]	$- \frac{d ℒ}{d w}$
1st order Taylor expansion [16]	$\| \frac{d ℒ}{d w} w \|$

Table 3. Table 3: Published approaches for channel pruning.

Method	Base Input,	Pointwise Measure,	After Reduction,	Scaling,
	$X$	$f (x)$	${}^{l}{\tilde{S}}_{i}$	$K$
Using only weights
L1-norm of weights [4, 43, 26]	$W$	$x$	$\sum_{x \in i {}^{l}X_{i}} \| f (x) \|$	$1$
L2-norm of weights [17, 41]	$W$	$x$	$\sum_{x \in {}^{l}X_{i}} f {(x)}^{2}$	$1$
Min-weight [23]	$W$	$x$	$\sum_{x \in {}^{l}X_{i}} f {(x)}^{2}$	${∥ {}^{l}X_{i} ∥}_{0}$
NISP [44]	$W$	$a l t e r n a t e B a c k p r o p (x)$ with $a l t e r n a t e B a c k p r o p (x) = N I S P (x)$ (Equation 9)	$\sum_{x \in {}^{l}X_{i}} f (x)$	$1$
Geometric median of weights [45]	$W$	$x - r e c o n s t r u c t i o n (x)$ with $r e c o n s t r u c t i o n (x) = G M (x)$ (Equation 5)	$\sum_{x \in {}^{l}X_{i}} f {(x)}^{2}$	$1$
Using weights and input images
Sum of feature map [37]	$A$	$x$	$\sum_{x \in {}^{l}X_{i}} f (x)$	$1$
APoZ [22]	$A$	${\begin{matrix} 1, if x > 0 \\ 0, else \end{matrix}$	$\sum_{x \in {}^{l}X_{i}} f (x)$	$n ({}^{l}X_{i})$
L2-norm of activations [36]	$A$	$x$	$\sum_{x \in {}^{l}X_{i}} f {(x)}^{2}$	$1$
Using weights, input images and labels
Fisher information using activations [40, 46]	$A$	$x \frac{d ℒ}{d x}$	${(\sum_{x \in {}^{l}X_{i}} f (x))}^{2}$	$\frac{1}{2}$
Fisher information using weights [47]	$W$	$x \frac{d ℒ}{d x}$	${(\sum_{x \in {}^{l}X_{i}} f (x))}^{2}$	$1$
1st Order Taylor [23]	$A$	$x \frac{d ℒ}{d x}$	$\| \sum_{x \in {}^{l}X_{i}} f (x) \|$	$n ({}^{l}X_{i})$
1st Order Taylor, w. norm [23]	$A$	$x \frac{d ℒ}{d x}$	$\| \sum_{x \in {}^{l}X_{i}} f (x) \|$	${∥ {}^{l}{\tilde{S}} ∥}_{2}$
Average of gradient [48]	$A$	$\frac{d ℒ}{d x}$	$\sum_{x \in {}^{l}X_{i}} f (x)$	$n ({}^{l}X_{i})$
Collaborative channel pruning [49]	$W$	$\frac{1}{2} x_{0} x_{1} \frac{d^{2} ℒ}{d x_{0} d x_{1}}$ (Table 4)	$\sum_{\begin{matrix} x_{0}, x_{1} : \\ x_{0}, x_{1} \notin \tilde{{}^{l}X_{i}} \end{matrix}} f (x_{0}, x_{1})$	$1$
Connection sensitivity [12]	$W$ (Section 7.1)	$x \frac{d ℒ}{d x}$	$\sum_{x \in {}^{l}X_{i}} \| f (x) \|$	$\sum_{s \in {}^{l}{\tilde{S}}} s$

Table 4. Table 4: Approximations applied to the terms in Equation 20 to obtain a saliency metric for pruning.

Saliency metric	1^st order terms	2^nd order terms (Hessian)
Saliency metric	1^st order terms	Shape	Approximate	Approximation Used
Optimal Brain Damage [9]	Omitted	Diagonal	Y	Levenberg-Marquadt
Optimal Brain Surgeon [10]	Omitted	Full	Y	Fisher
First order Taylor [23]	Exact	Omitted	-	-
Fisher Information [40]	Omitted	Diagonal	Y	Fisher
Collaborative Channel Pruning [49]	Omitted	Full	Y	Gauss-Newton with $H_{σ} = d i a g (L ⊘ (P ⊙ P))$

Table 5. Table 5: Summary of trained network accuracy on CIFAR-10, CIFAR-100 and ImageNet-12.

	LeNet-5	CIFAR10	ResNet-20	NIN	AlexNet
CIFAR-10	69.4%	72.8%	88.4%	88.3%	84.2%
CIFAR-100	-	-	59.2%	65.7%	54.2%
ImageNet-32	-	-	-	-	39.7%

Table 6. Table 6: Effectiveness of metrics which use different information. The maximum sparsity achieved (%) with Algorithm 1 is shown.

CIFAR-10 dataset
Network	Weights-based	Activation-based	Gradients-based
LeNet-5	80.4 $\pm$ 2	78.5 $\pm$ 2	84.9 $\pm$ 4
CIFAR-10	63.0 $\pm$ 12	63.5 $\pm$ 4	67.5 $\pm$ 5
ResNet-20	17.6 $\pm$ 7	20.6 $\pm$ 24	25.4 $\pm$ 9
NIN	63.6 $\pm$ 2	59.1 $\pm$ 3	72.8 $\pm$ 3
AlexNet	68.0 $\pm$ 0.3	69.6 $\pm$ 2	70.1 $\pm$ 2
CIFAR-100 dataset
ResNet-20	5.8 $\pm$ 0.4	6.6 $\pm$ 1	12.5 $\pm$ 3
NIN	59.2 $\pm$ 1	54.4 $\pm$ 2	52.6 $\pm$ 1
AlexNet	60.2 $\pm$ 0.1	60.2 $\pm$ 11	64.0 $\pm$ 3
ImageNet-32 dataset
AlexNet	55.4 $\pm$ 7	51.6 $\pm$ 2	51.7 $\pm$ 5

Table 7. Table 7: The best performing saliency metric in each scenario.

Network	Sparsity %	Saliency Metric
CIFAR-10 dataset
LeNet-5	84.9 $\pm$ 4	$\sum_{x \in {}^{l}A_{i}} {(\frac{d ℒ}{d x})}^{2}$
CIFAR-10	67.5 $\pm$ 5	$\sum_{x \in {}^{l}A_{i}} \| - x \frac{d ℒ}{d x} + \frac{x^{2}}{2} {\frac{d^{2} ℒ}{d x^{2}}}_{G N} \|$
ResNet-20	25.4 $\pm$ 9	$\frac{1}{n ({}^{l}W_{i})} {(\sum_{x \in {}^{l}A_{i}} \frac{d ℒ}{d x})}^{2}$
NIN	72.8 $\pm$ 3	$\frac{1}{n ({}^{l}W_{i})} {(\sum_{x \in {}^{l}A_{i}} - x \frac{d ℒ}{d x} + \frac{x^{2}}{2} {\frac{d^{2} ℒ}{d x^{2}}}_{G N})}^{2}$
AlexNet	70.1 $\pm$ 2	$\frac{1}{n ({}^{l}W_{i})} \sum_{x \in {}^{l}W_{i}} \| - x \frac{d ℒ}{d x} \|$
CIFAR-100 dataset
ResNet-20	12.5 $\pm$ 3	${(\sum_{x \in {}^{l}A_{i}} \frac{d ℒ}{d x})}^{2}$
NIN	59.2 $\pm$ 1	$\frac{1}{{∥ {}^{l}S ∥}_{1}} \sum_{x \in {}^{l}W_{i}} \| x \|$
AlexNet	64.0 $\pm$ 3	$\frac{1}{{∥ {}^{l}A_{i} ∥}_{0}} \sum_{x \in {}^{l}A_{i}} \| \frac{d ℒ}{d x} \|$
ImageNet-32 dataset
AlexNet	55.4 $\pm$ 7	$\frac{1}{n ({}^{l}W_{i})} \sum_{x \in {}^{l}W_{i}} \| x \|$

Table 8. Table 8: Comparison between different pointwise saliency metrics. The maximum proportion of weights removed (sparsity) by Algorithm 1 (%) using different pointwise saliency metrics with weights as the input ( x = w 𝑥 𝑤 x=w ) and Equation 25 as reduction method.

Metric	$w$	$\frac{d ℒ}{d w}$	$- w \frac{d ℒ}{d w}$	$- w \frac{d ℒ}{d w} + \frac{w^{2}}{2} {\frac{d^{2} ℒ}{d w^{2}}}_{G N}$	$\frac{w^{2}}{2} {\frac{d^{2} ℒ}{d w^{2}}}_{G N}$	$- w \frac{d ℒ}{d w} + \frac{w^{2}}{2} {\frac{d^{2} ℒ}{d w^{2}}}_{L M}$	$\frac{w^{2}}{2} {\frac{d^{2} ℒ}{d w^{2}}}_{L M}$
Network	Weights only	Weights, input images and labels
		Gradients only	Taylor expansions
			1^st order	2^nd order with diagonal Hessian
			1^st order	Gauss-Newton approximation		Levenberg-Marquardt approximation
CIFAR-10 dataset
LeNet-5	78.7 $\pm$ 9.8	64.3 $\pm$ 13.9	80.6 $\pm$ 3.0	80.3 $\pm$ 3.8	79.0 $\pm$ 8.0	80.8 $\pm$ 2.8	78.5 $\pm$ 5.2
CIFAR10	16.7 $\pm$ 62.9	19.3 $\pm$ 19.7	53.0 $\pm$ 24.5	50.7 $\pm$ 24.2	61.7 $\pm$ 17.6	50.4 $\pm$ 28.7	22.3 $\pm$ 5.6
ResNet-20	1.7 $\pm$ 0.0	6.7 $\pm$ 8.7	4.6 $\pm$ 3.7	4.6 $\pm$ 2.8	4.0 $\pm$ 2.3	4.4 $\pm$ 4.2	1.2 $\pm$ 0.1
NIN	34.5 $\pm$ 0.5	5.6 $\pm$ 1.8	61.1 $\pm$ 2.8	20.1 $\pm$ 59.1	13.7 $\pm$ 71.5	60.3 $\pm$ 2.3	42.8 $\pm$ 1.7
AlexNet	64.0 $\pm$ 9.6	40.1 $\pm$ 4.8	51.7 $\pm$ 9.9	52.0 $\pm$ 9.8	49.7 $\pm$ 8.6	55.1 $\pm$ 6.1	20.0 $\pm$ 24.8
CIFAR-100 dataset
ResNet-20	3.4 $\pm$ 0.2	3.2 $\pm$ 3.8	2.8 $\pm$ 6.1	4.1 $\pm$ 5.0	1.9 $\pm$ 14.0	4.1 $\pm$ 3.4	1.1 $\pm$ 0.1
NIN	42.2 $\pm$ 1.0	35.2 $\pm$ 0.2	36.2 $\pm$ 0.1	36.4 $\pm$ 1.0	37.1 $\pm$ 0.1	36.2 $\pm$ 0.9	52.4 $\pm$ 2.7
AlexNet	58.6 $\pm$ 21.4	26.4 $\pm$ 14.5	56.1 $\pm$ 14.8	52.0 $\pm$ 19.3	50.2 $\pm$ 11.2	52.1 $\pm$ 5.7	27.0 $\pm$ 7.3
ImageNet-32 dataset
AlexNet	28.2 $\pm$ 2.1	31.0 $\pm$ 3.3	31.1 $\pm$ 0.8	31.2 $\pm$ 2.2	35.2 $\pm$ 1.0	31.2 $\pm$ 0.8	1.1 $\pm$ 0.1

Table 9. Table 9: Comparison between different pointwise saliency metrics. The maximum proportion of weights removed (sparsity) by Algorithm 1 (%) using different pointwise saliency metrics with output points as the input ( x = a 𝑥 𝑎 x=a ) and Equation 25 as reduction method.

Metric	$a$	$\frac{d ℒ}{d a}$	$- a \frac{d ℒ}{d a}$	$- a \frac{d ℒ}{d a} + \frac{a^{2}}{2} {\frac{d^{2} ℒ}{d a^{2}}}_{G N}$	$\frac{a^{2}}{2} {\frac{d^{2} ℒ}{d a^{2}}}_{G N}$	$- a \frac{d ℒ}{d a} + \frac{a^{2}}{2} {\frac{d^{2} ℒ}{d a^{2}}}_{L M}$	$\frac{a^{2}}{2} {\frac{d^{2} ℒ}{d a^{2}}}_{L M}$
Network	Weights and input images	Weights, input images and labels
		Gradients only	Taylor expansions
			1^st order	2^nd order with diagonal Hessian
			1^st order	Gauss-Newton approximation		Levenberg-Marquardt approximation
CIFAR-10 dataset
LeNet-5	77.1 $\pm$ 2.5	82.0 $\pm$ 2.1	82.5 $\pm$ 2.2	82.6 $\pm$ 2.7	74.6 $\pm$ 2.3	82.1 $\pm$ 1.8	69.7 $\pm$ 5.6
CIFAR10	56.0 $\pm$ 14.5	56.7 $\pm$ 19.6	65.4 $\pm$ 15.0	66.8 $\pm$ 16.0	57.3 $\pm$ 21.6	65.3 $\pm$ 15.9	22.3 $\pm$ 17.2
ResNet-20	5.5 $\pm$ 3.2	4.2 $\pm$ 3.5	11.5 $\pm$ 14.6	12.8 $\pm$ 17.3	9.5 $\pm$ 14.1	13.2 $\pm$ 13.4	2.4 $\pm$ 0.3
NIN	32.9 $\pm$ 25.4	5.4 $\pm$ 61.8	58.3 $\pm$ 4.3	56.9 $\pm$ 4.1	62.7 $\pm$ 14.2	59.7 $\pm$ 12.6	38.7 $\pm$ 24.6
AlexNet	69.2 $\pm$ 4.4	63.9 $\pm$ 5.6	65.0 $\pm$ 3.0	63.3 $\pm$ 2.4	57.0 $\pm$ 5.4	63.8 $\pm$ 4.3	35.3 $\pm$ 5.1
CIFAR-100 dataset
ResNet-20	3.8 $\pm$ 1.7	2.4 $\pm$ 1.0	4.6 $\pm$ 2.4	4.7 $\pm$ 2.8	5.2 $\pm$ 2.3	4.6 $\pm$ 1.7	6.4 $\pm$ 4.3
NIN	42.6 $\pm$ 2.2	35.2 $\pm$ 0.0	36.2 $\pm$ 0.1	36.2 $\pm$ 0.0	37.0 $\pm$ 0.8	36.1 $\pm$ 0.1	45.6 $\pm$ 2.0
AlexNet	57.6 $\pm$ 6.7	62.9 $\pm$ 2.2	62.7 $\pm$ 3.3	62.9 $\pm$ 2.7	52.5 $\pm$ 5.8	62.4 $\pm$ 3.7	33.1 $\pm$ 23.6
ImageNet-32 dataset
AlexNet	19.4 $\pm$ 4.1	40.4 $\pm$ 1.9	30.3 $\pm$ 1.2	30.2 $\pm$ 1.5	25.6 $\pm$ 0.9	30.4 $\pm$ 1.7	2.9 $\pm$ 1.4

Table 10. Table 10: Spearman correlation (with p-value) between metric quality and computational cost of pruning including retraining.

CIFAR-10					CIFAR-100			ImageNet-32
LeNet-5	CIFAR10	ResNet-20	NIN	AlexNet	ResNet-20	NIN	AlexNet	AlexNet
-0.10 ( $3 e^{- 1}$ )	-0.5 ( $8 e^{- 4}$ )	1 ( $0$ )	-0.7 ( $6 e^{- 2}$ )	-0.9 ( $2 e^{- 19}$ )	-0.6 ( $1 e^{- 1}$ )	-0.6 ( $4 e^{- 1}$ )	-0.8 ( $4 e^{- 18}$ )	-0.9 ( $3 e^{- 3}$ )

Table 11. Table 11: Cost of saliency metrics using ℐ v a l subscript ℐ 𝑣 𝑎 𝑙 \mathcal{I}_{val} ( N v a l subscript 𝑁 𝑣 𝑎 𝑙 N_{val} batches of images).

Pointwise Metric		Cost	Pointwise Metric	Cost
$x$	$x = w$	$0$	$- x \frac{d ℒ}{d x} + \frac{x^{2}}{2} {\frac{d^{2} ℒ}{d x^{2}}}_{G N}$	$3 \times N_{v a l}$
$x$	$x = a$	$1 \times N_{v a l}$	$\frac{x^{2}}{2} {\frac{d^{2} ℒ}{d x^{2}}}_{G N}$	$3 \times N_{v a l}$
$\frac{d ℒ}{d x}$		$3 \times N_{v a l}$	$- x \frac{d ℒ}{d x} + \frac{x^{2}}{2} {\frac{d^{2} ℒ}{d x^{2}}}_{L M}$	$5 \times N_{v a l}$
$- x \frac{d ℒ}{d x}$		$3 \times N_{v a l}$	$\frac{x^{2}}{2} {\frac{d^{2} ℒ}{d x^{2}}}_{L M}$	$5 \times N_{v a l}$

Equations48

S = \frac{1}{K} \cdot S, with S = R \circ F (X)

S = \frac{1}{K} \cdot S, with S = R \circ F (X)

\prescript l S_{i} = \frac{1}{N} n = 0 \sum N - 1 \frac{1}{K} \cdot R \circ F (X_{n})

\prescript l S_{i} = \frac{1}{N} n = 0 \sum N - 1 \frac{1}{K} \cdot R \circ F (X_{n})

\prescript l S_{i} = \frac{1}{n ( \prescript l W _{i} )} w \in \prescript l W_{i} \sum w^{2}

\prescript l S_{i} = \frac{1}{n ( \prescript l W _{i} )} w \in \prescript l W_{i} \sum w^{2}

\prescript l W_{GM} \in a r g min (g (x)) with g (x) = j = 0 \sum \prescript l m - 1 ∥ x - \prescript l W_{j} ∥_{2}

\prescript l W_{GM} \in a r g min (g (x)) with g (x) = j = 0 \sum \prescript l m - 1 ∥ x - \prescript l W_{j} ∥_{2}

Given x = \prescript l W_{i} [p, q, r], then GM (x) = \prescript l W_{GM} [p, q, r]

Given x = \prescript l W_{i} [p, q, r], then GM (x) = \prescript l W_{GM} [p, q, r]

\prescript l A = f^{l} (\prescript l W, \prescript l - 1 A)

\prescript l A = f^{l} (\prescript l W, \prescript l - 1 A)

\frac{d L}{d \prescript l A} = \frac{d L}{d \prescript l + 1 A} \frac{d \prescript l + 1 A}{d \prescript l A}

\frac{d L}{d \prescript l A} = \frac{d L}{d \prescript l + 1 A} \frac{d \prescript l + 1 A}{d \prescript l A}

\prescript l S = h^{l} (\prescript l + 1 W, \prescript l + 1 S)

\prescript l S = h^{l} (\prescript l + 1 W, \prescript l + 1 S)

Given x = \prescript l W_{i} [p, q, r], then N I S P (x) = \prescript l S [p, q, r]

Given x = \prescript l W_{i} [p, q, r], then N I S P (x) = \prescript l S [p, q, r]

\prescript l S_{i} = \frac{1}{n ( \prescript l A _{i} )} a \in \prescript l A_{i} \sum a

\prescript l S_{i} = \frac{1}{n ( \prescript l A _{i} )} a \in \prescript l A_{i} \sum a

\prescript l S_{i} = \frac{1}{n ( \prescript l A _{i} )} a \in \prescript l A_{i} \sum f (a) with f (a) = {10 if a > 0 else

\prescript l S_{i} = \frac{1}{n ( \prescript l A _{i} )} a \in \prescript l A_{i} \sum f (a) with f (a) = {10 if a > 0 else

L (W, I_{se t}) = n = 0 \sum N - 1 \prescript n L (W) = n = 0 \sum N - 1 - (\prescript n L ⊙ l o g (P (W, \prescript n I)))

L (W, I_{se t}) = n = 0 \sum N - 1 \prescript n L (W) = n = 0 \sum N - 1 - (\prescript n L ⊙ l o g (P (W, \prescript n I)))

\prescript l S_{i} = \frac{1}{n ( \prescript l A _{i} )} a \in \prescript l A_{i} \sum \frac{d L}{d a}

\prescript l S_{i} = \frac{1}{n ( \prescript l A _{i} )} a \in \prescript l A_{i} \sum \frac{d L}{d a}

S (w_{i}) = \frac{∣ g _{i} ( w ; I _{se t} ) ∣}{k = 1 \sum m ∣ g _{k} ( w ; I _{se t} ) ∣}

S (w_{i}) = \frac{∣ g _{i} ( w ; I _{se t} ) ∣}{k = 1 \sum m ∣ g _{k} ( w ; I _{se t} ) ∣}

g_{i}(w;\mathcal{I}_{set})=\frac{\partial\mathcal{L}(M\odot W;\mathcal{I}_{set})}{\partial m_{i}}\Bigr{|}_{M=1}

g_{i}(w;\mathcal{I}_{set})=\frac{\partial\mathcal{L}(M\odot W;\mathcal{I}_{set})}{\partial m_{i}}\Bigr{|}_{M=1}

\displaystyle\frac{\partial\mathcal{L}(M\odot W;\mathcal{I}_{set})}{\partial m_{i}}\Bigr{|}_{M=1}

\displaystyle\frac{\partial\mathcal{L}(M\odot W;\mathcal{I}_{set})}{\partial m_{i}}\Bigr{|}_{M=1}

= \frac{\partial L ( W ; I _{se t} )}{\partial w _{i}} \cdot w_{i}

= w_{i} \cdot \frac{\partial L ( W ; I _{se t} )}{\partial w _{i}}

S e n s i t i v i t y = L (W) - L (W)

S e n s i t i v i t y = L (W) - L (W)

L (W) \approx L (W)

L (W) \approx L (W)

L (W)

L (W)

S (w_{i}) = - w_{i} \frac{\partial L}{\partial w _{i}} + \frac{1}{2} w_{i}^{2} \frac{\partial ^{2} L}{\partial w _{i} ^{2}}

S (w_{i}) = - w_{i} \frac{\partial L}{\partial w _{i}} + \frac{1}{2} w_{i}^{2} \frac{\partial ^{2} L}{\partial w _{i} ^{2}}

\prescript l S_{i} = \frac{1}{n ( \prescript l A _{i} )} a \in \prescript l A_{i} \sum a \frac{d L}{d a}

\prescript l S_{i} = \frac{1}{n ( \prescript l A _{i} )} a \in \prescript l A_{i} \sum a \frac{d L}{d a}

\prescript l S_{i} = \frac{1}{2} a \in \prescript l A_{i} \sum a \frac{d L}{d a}^{2}

\prescript l S_{i} = \frac{1}{2} a \in \prescript l A_{i} \sum a \frac{d L}{d a}^{2}

\prescript l S_{i} = x \in \prescript l X_{i} \sum ∣ f (x) ∣

\prescript l S_{i} = x \in \prescript l X_{i} \sum ∣ f (x) ∣

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsPruning

Full text

Taxonomy of Saliency Metrics for Channel Pruning

KAVEENA PERSAND

Authors are with the School of Computer Science and Statistics, Trinity College Dublin, Ireland e-mail: [email protected], [email protected], [email protected]

ANDREW ANDERSON, AND DAVID GREGG

Authors are with the School of Computer Science and Statistics, Trinity College Dublin, Ireland e-mail: [email protected], [email protected], [email protected]

Abstract

Pruning unimportant parameters can allow deep neural networks (DNNs) to reduce their heavy computation and memory requirements. A saliency metric estimates which parameters can be safely pruned with little impact on the classification performance of the DNN. Many saliency metrics have been proposed, each within the context of a wider pruning algorithm. The result is that it is difficult to separate the effectiveness of the saliency metric from the wider pruning algorithm that surrounds it. Similar-looking saliency metrics can yield very different results because of apparently minor design choices. We propose a taxonomy of saliency metrics based on four mostly-orthogonal principal components. We show that a broad range of metrics from the pruning literature can be grouped according to these components. Our taxonomy not only serves as a guide to prior work, but allows us to construct new saliency metrics by exploring novel combinations of our taxonomic components. We perform an in-depth experimental investigation of more than 300 saliency metrics. Our results provide decisive answers to open research questions, and demonstrate the importance of reduction and scaling when pruning groups of weights. We find that some of our constructed metrics can outperform the best existing state-of-the-art metrics for convolutional neural network channel pruning.

1 Introduction

Deep neural networks (DNNs) now offer human-level or greater accuracy for many decision problems [1, 2, 3]. However, DNNs can require enormous computation and memory resources, which may be unavailable in mobile and embedded devices where audio, image and other data often originate. Transferring this data off-device to the cloud for DNN processing creates many problems with latency of response, energy, legal and privacy issues.

One way to reduce the resource requirements of trained DNNs is to prune unused parameters, or more concretely, to replace some of the values in weight tensors by zero. An ideal pruning algorithm would remove the maximal number of weights from a network while maintaining or improving accuracy. While there is a huge variety of pruning schemes in the literature, the vast majority have, at their heart, a saliency metric. A saliency metric is used to answer a fundamental question in pruning: which weight, or set of weights, when removed, will likely cause the least damage to the network predictions? Since the saliency metric is typically presented within the context of a larger pruning algorithm, it is often extremely difficult to isolate the effect of the saliency metric from other design choices.

Pruning of DNN weight tensors can be performed at different levels of granularity [4, 5], from individual weights [6] to large sub-blocks. Individual weights typically participate in the computation of many different elements of the output feature map. Thus, at most levels of pruning granularity, saliency metrics focus on the weight itself rather than the output feature maps computed using the weight. However, for convolutional neural networks, there is one level of pruning granularity — channel pruning — where there is a direct relationship between sub-blocks of the weight tensor and sub-blocks of the output feature map. We therefore focus on channel pruning, which allows us to compare metrics based on weights or ouput feature maps. Pruning full channels also yields a dense weight tensor which can be used with existing highly-optimized DNN libraries.

Contributions

In this paper, we study the impact of the choice of saliency metric with a canonical channel pruning algorithm for CNNs. Although our empirical results are for this specific context, there is a strong argument for the generality of some of our findings, which we highlight in discussion. We make the following specific contributions.

•

We propose a taxonomy that classifies saliency metrics based on four mostly-orthogonal components.

•

We empirically evaluate 308 saliency metrics, including metrics from prior work and novel metrics derived from new combinations of taxonomic components.

•

We experimentally confirm a widely-acknowledged rule of thumb: gradient-based approaches as a class significantly outperform simpler weight-based methods.

•

We answer open research questions, such as whether the popular strategy of ignoring first-order terms in Taylor-expansions is safe in practice.

•

We find the non-obvious result that the choice of dimensionality reduction and parameter scaling method has a large impact on saliency metrics, and we propose an effective novel scaling method for channel pruning.

•

We show that good saliency metrics can be effective even without any subsequent fine-tuning/retraining, or greatly reduce the number of iterations required if retraining is used.

2 Background

A large number of pruning schemes have been proposed. Most pruning schemes either modify the training process to gradually prune weights or have an explicit discrete step that removes weights from the network. As well as the mechanism used to drive weights to zero, pruning schemes also determine at which stage of the training process pruning should apply. While early work focused on pruning fully trained networks [7, 8, 9, 10], recent work has shown that various methods can be used to remove parameters from the network at different stages of the training process [11, 12].

The pruning schemes can be categorized according to three dimensions: pruning method, training strategy, and estimation criterion [13]. Saliency metrics or estimation criteria [13] are used to estimate which weights can be pruned from the network. Saliency metrics are used whether pruning is incorporated into the training algorithm [14, 15, 16, 17, 18, 19, 20, 21] or occurs in discrete steps which sets weights to zero [22, 4, 23, 9, 10, 24, 25, 11, 12]. The same saliency metric can be used in very different pruning schemes. For example, the L1-norm of weights has been used as saliency metric with simple pruning schemes [6, 4], with probabilistic pruning [26], with reinforcement learning [27], and even for pruning at initialization [11]. Saliency metrics can also be used when pruning in Winograd domain [28, 29] and frequency domain [30].

Saliency metrics can be grouped depending the information used to compute them. Figure 1 groups saliency metrics according to the information that they use. Saliency metrics that use only the weights have the advantage of having all required information readily available. Data driven approaches require training data to make pruning decisions. Approaches that only use input images to make pruning decisions require only forward passes of the network. Approaches that use gradients additionally require backward passes of the network to compute the loss with respect to input labels and hence the gradients. The main differentiating factor between these classes of approach in practice (Figure 1) is the cost associated with the use of more information.

Saliency metrics can also be grouped according to how they identify least important weights [13]. Simple metrics like the L1-norm of weight [6] or APoZ [22] assume that small weights or feature maps with high frequencies of zeros are less important to the network. Taylor expansion-based metrics often approximate the change in global [9, 10, 23, 16] or layer-wise [31] loss caused by pruning and remove weights that cause the least change in loss. ThiNet [32], He et al. [33], and Lin et al. [34] choose weights that lead to the least feature map reconstruction error. Hur and Kang [35] use the entropy of the weights to determine which weights are least important. While these saliency metrics are derived with different assumptions, we can compare them by expressing them in a standard form.

3 A Taxonomy of Saliency Metrics

We describe existing saliency metrics using a taxonomy of four principle components. A fine-grain saliency metric is constructed from a pointwise metric $F$ over the parameter set $X$ . When we prune larger groups of weights, such as entire channels, we need to combine the saliency of individual weights into a metric for the entire channel. We do this with a dimensionality reduction $R$ and normalization $K$ . Some examples of choices for $X$ , $F$ , $R$ and $K$ are given in Table 1.

The general form of the saliency metric for an arbitrary subset of parameters $X$ is given by Equation 1.

[TABLE]

We introduce some notation to facilitate the description and comparison of different saliency metrics. Consider a CNN with loss function $\mathcal{L}$ , and trained parameters vector $W$ . Let $\widetilde{W}$ represent the pruned parameters vector and $n(\cdot)$ the cardinality of a tensor or vector, then $n(W)>n(\widetilde{W})$ .

In our mathematical treatment, we consider this general case unless stated otherwise. However, in our experimental evaluation, we are concerned with parameter subsets $X$ corresponding to the parameters $\prescript{l}{}{W}_{i}$ of a channel of a convolutional layer. Thus, $\prescript{l}{}{S}_{i}$ denotes the saliency of the $i^{th}$ channel of the $l^{th}$ convolution layer of the CNN.

3.1 Domain (choice of $X$ )

In our taxonomy, saliency metrics can be based upon the weights themselves ( $X=W$ ), or the values of output features maps that are computed using the weights ( $X=A$ ). There is often a close relationship between the magnitude of a weight and the sensitivity of the DNN to pruning the weight, so many saliency metrics use the weight as a key input. However, in the specific case of pruning entire output channels, there is a direct relationship with the output points corresponding to the $i^{th}$ channel of the $l^{th}$ convolution layer $\prescript{l}{}{C}_{i}$ (i.e. the feature map $\prescript{l}{}{A}_{i}$ ). Removing all parameters contributing to an output feature map in a convolutional layer results in the feature map becoming zero. When this happens, all of the operations which are transitively used to compute the operation can also be pruned, resulting in large savings.

Hence, the saliency of a channel can be regarded as a function of outputs [22, 36, 37, 38], rather than parameters, i.e. in Equation 1 $X$ can be either the weights, $\prescript{l}{}{W}_{i}$ , or the output feature map, $\prescript{l}{}{A}_{i}$ . The relationship between the weights and feature maps of a channel is illustrated in detail in Figure 2.

It should be noted that since channel pruning can be viewed as feature map removal, feature selection metrics can also be used as saliency metrics for pruning [39].

When pruning at granularities finer than entire channels, the removal of output points cannot be directly mapped to the removal of sets of weights. Hence, it is clear that $X=W$ is the best choice. However, for channel pruning it is difficult to choose definitively between output feature map or weights. In fact, most published saliency metrics are presented either as functions of weights or of output points, despite being applicable to both. Some metrics [9, 10] were originally defined using weights (as they were used to prune individual weights), but derived metrics [23, 40] use output feature maps instead.

To better illustrate the effect of the four orthogonal choices, we use an example for channel pruning. In Figure 3, we show how to compute the saliency of a convolution layer’s channels using its weights. This corresponds to the case where $X=W$ .

3.2 Pointwise metric (choice of $F$ )

We denote $F(X)$ the tensor of pointwise saliency of all individual weights or output points. $F(X)$ is of the same shape as $X$ , that is, either of the shape of $W$ or $A$ . When pruning, it is common to look at either the saliency of an individual element of the saliency vector or at a group of them. To facilitate this grouping, we introduce $\prescript{l}{}{F(X)}_{i}$ and $f(x)$ . $\prescript{l}{}{F(X)}_{i}$ is the tensor of saliency corresponding to the $i^{th}$ channel of the $l^{th}$ layer. $\prescript{l}{}{F(X)}_{i}$ is of the same shape as $\prescript{l}{}{X}_{i}$ . Hence if $X=W$ or $X=A$ , then $\prescript{l}{}{F(X)}_{i}$ is of the shape $\prescript{l-1}{}{m}\times\prescript{l}{}{k}\times\prescript{l}{}{k}$ or $\prescript{l}{}{height}\times\prescript{l}{}{width}$ respectively. $f(x)$ , is used to denote the saliency of a single weight or output point. If $x=\prescript{l}{}{W}_{i}[p,q,r]$ is an individual weight from $\prescript{l}{}{W}_{i}$ , then $f(x)=\prescript{l}{}{F(X)}_{i}[p,q,r]$ . Similarly when using the output feature map instead of the weights, if $x=\prescript{l}{}{A}_{i}[p,q]$ is an individual output point then $f(x)=\prescript{l}{}{F(X)}_{i}[p,q]$ .

A common pointwise saliency function is the absolute magnitude function, i.e. the saliency of an individual weight or output point is given directly by its absolute value, hence $f(x)=\left\lvert x\right\rvert$ .

In Figure 3, the gradient of an element is used as a saliency metric, $F(X)=J$ or $f(x)=\frac{d\mathcal{L}}{dx}$ . Applying $F$ to $W$ yields a tensor containing the saliency of the individual weights.

3.3 Dimensionality Reduction (choice of $R$ )

Once the pointwise saliency vector is obtained, a reduction is used to condense the tensor of pointwise saliency to a single value for the pruning element. In the case of channel pruning, $R$ reduces either a $\prescript{l-1}{}{m}\times\prescript{l}{}{k}\times\prescript{l}{}{k}$ tensor or a $\prescript{l}{}{height}\times\prescript{l}{}{width}$ tensor into a single value.

Any suitable vector norm could be used as a reduction, with the L2-norm being a popular reduction method in the literature [41, 36].

In Figure 3 an L1-norm is used to reduce the $\prescript{l}{}{m}\times\prescript{l}{}{k}\times\prescript{l}{}{k}$ tensor of weight saliency into a vector containing the layer’s channel saliency , $\prescript{l}{}{\widetilde{S}}$ , of dimension $\prescript{l}{}{m}$ .

3.4 Scaling (choice of $K$ )

All other things being equal, the more parameters which can be pruned, the lower the computational and memory costs of inference. Therefore, if two channels have similar saliency, one should favour pruning the larger channel, which is the Pareto-optimal choice considering the twin objectives of minimizing accuracy loss and maximizing the number of parameters pruned.

To better integrate this cost function, a scaling coefficient, $K$ is typically used. One solution is to scale the channel saliency $\prescript{l}{}{S}_{i}$ using the cardinality of the pointwise saliency vector. In other words, one can also look at the average saliency of a group instead of the sum of the saliency in the group. For example, instead of using the sum of the magnitudes or L1-norm of the weights, $\left\lVert\prescript{l}{}{W}_{i}\right\rVert_{1}$ , one can use the average of the magnitudes, $\frac{1}{n(\prescript{l}{}{W}_{i})}\left\lVert\prescript{l}{}{W}_{i}\right\rVert_{1}$ . This nomalizes the result of the reduction, $R$ , so that channels with many weights are more likely to be pruned, leading to greater overall sparsity. Another solution is to perform a layer-wise normalisation to scale the magnitudes of saliency across layers. Using a layer-wise L2-norm in the case of global pruning helps when values for saliency in different layers have drastically different magnitudes [23].

In Figure 3, we use a layer-wise L2-norm as scaling factor. We use the L2-norm of the unscaled saliency of all the channels in the given convolution layer, $\prescript{l}{}{\widetilde{S}}$ , to obtain the scaling coefficient, $K$ , of that layer.

3.4.1 Proposed scaling method, $\mathcal{TC}$

We propose to investigate a new scaling method, $\mathcal{TC}$ . $\mathcal{TC}(\prescript{l}{}{W}_{i})$ is used to denote the entire set of weights transitively removed when $\prescript{l}{}{W}_{i}$ is removed from the network. A simple example of $\mathcal{TC}(\prescript{l}{}{W}_{i})$ is shown in Figure 2, where weights from the next convolution layer are also removed when we remove an output channel from the network. Since $\mathcal{TC}(\prescript{l}{}{W}_{i})$ include $\prescript{l}{}{W}_{i}$ , $n(\mathcal{TC}(\prescript{l}{}{W}_{i}))\geq n(\prescript{l}{}{W}_{i})$ . When optimising for the maximum number of weights removed for the least loss in accuracy, $\mathcal{TC}(\prescript{l}{}{W}_{i})$ is interesting because it takes into account all the weights removed for that channel.

3.5 Minibatches

Data driven approaches which consider gradients or output points in the domain $X$ often rely on a set of inputs to produce these values. Since each input in the set will result in a potentially different set of saliency values, the computation of the saliency over a minibatch is typically done by combining the element-wise saliency values using a simple average across the minibatch [40, 23, 36, 22].

Hence, the full saliency equation used in practice is given by Equation 2 with $N$ the total number of images used in the minibatch.

[TABLE]

In some cases, a square root is applied to the resulting saliency metric [36]. This additional computation does not modify the ranking of the channels and can thus be omitted for algorithms only concerned with channel ranking.

4 Classification of existing saliency metrics

In this section, we classify popular saliency metrics using our taxonomy. Table 3 summarizes some channel saliency metrics that have been used for pruning convolution channels. We can obtain new saliency metrics by selecting different combinations of of the four components in Table 1. Although some of these saliency metrics resemble each other, their efficacy can vary.

This paper focuses on using saliency metrics specifically for channel pruning. However, the taxonomy can also be used to classify saliency metrics used for different granularities. For fine-grain pruning, i.e. pruning individual weights, only the pointwise saliency is relevant. For other granularities of pruning, pointwise metrics are computed across the relevant substructure in parameter tensors and then dimensionally reduced (e.g. with a vector or tensor norm) to yield structural metrics. Table 2 shows the classification of a selection of popular pointwise saliency metrics that have been used for fine-grain pruning.

5 Weight based Saliency Metrics

When computing weight-based metrics, all the information required to compute the saliency is readily available. The most common saliency metric used for pruning is the magnitude of the weights. Specifically, the L2 and L1 norms of weights have been used in many pruning schemes [4, 17, 43, 41, 26, 6, 24, 27, 11, 42] for different granularities of pruning. Magnitude-based saliency metrics assume that weights of lower magnitude have a lesser contribution to the network.

Another weight-based heuristic is min-weight [23]. The sum of squared individual weights are scaled by the number of weights in the channel to give the channel saliency (Equation 3).

[TABLE]

Most weight-based metrics consider smaller weights to be less important to the network. For coarse granularities of pruning, one can also consider removing redundant sets of weights. In the case of channel pruning, one can remove channels that are similar to other channels. Redundant channels can be removed if they are not contributing to the final result. He et al. [45] remove channels that have a euclidean distance close to that of the layer’s geometric median channel (see Equation 4).

[TABLE]

In this case the pointwise saliency $f(x)$ is given by $x-GM(x)$ where $GM(x)$ is given according to Equation 5.

[TABLE]

By removing channels that are close to the geometric median, one can assume that they are already represented by the geometric median. We denote $reconstruction(x)$ any reconstruction of $x$ after pruning, then in this case $reconstruction(x)=GM(x)$ .

5.0.1 Recursive Weight Based Metrics

Common weight-based methods only use weights from a single layer. Saliency metrics such as the L1-norm of weights treat channels as independent components.

To illustrate this independence, let us consider the example of using the L1-norm of weights as a saliency metric. First, we compute the saliency metric using the L1-norm of weights then remove the least salient channel. We, then, recompute the saliency of the remaining weights. The saliency of the channels that were not pruned remain unchanged.

Now, let us consider the case where we use the L1-norm of output points. We compute the saliency of all the channels, prune the least salient channel and finally recompute the saliency of the unpruned channels. Let us assume that the channel selected for pruning is from the third layer of the network. The saliency of the channels from the first, second and third layers (excluding the pruned channel) are unchanged. However, the saliency of channels from following layers may have changed. While the saliency of the channels are computed in a independent way, the underlying information (the output points), can be expressed recursively using the previous layers. The recursive component is implicit. The recursive equation for outputs points is given in Equation 6 with $f^{l}$ being the forward pass function of the $l^{th}$ layer.

[TABLE]

Saliency metrics that use the gradients of the loss with respect to the weights or the output points also have a component that is computed recursively using backpropagation (see Equation 7. Hence, the pointwise metrics in Table 1 can use information from different layers.

[TABLE]

Neuron Importance Score Propagation (NISP) [44] uses an explicitly recursive way of propagating saliency information between layers.

Equation 8 shows the general propagation equation used in NISP with $h^{l}$ being a function given by the authors. $f^{l}$ depends only on the type of the layer.

[TABLE]

In this case the pointwise saliency $f(x)$ used by $NISP(x)$ is given in Equation 9.

[TABLE]

The method used by NISP can be considered as an alternative way of backpropagating information through the network.

Gradient backpropagation and its alternatives can be used for pruning. An alternative to backpropagation of gradients is Layer-wise Relevance Propagation (LRP) [50]. LRP has successfully been used as a saliency metric for pruning [51]. Similar to gradient backpropagation and NISP, LRP propagates information recursively from the last (output) layer of the network to its first (input) layer. LRP requires weights and input images as it is a data driven approach but can, nonetheless, be expressed in a similar form to Equation 8.

We denote $alternateBackprop$ any alternative to backpropagation of information in neural networks. Hence, in the case of NISP we have $alternateBackprop(x)=NISP(x)$ .

6 Weight and Input Images Based Saliency Metrics

Channel pruning is a notable granularity of pruning as pruning an entire channel of weights leads to the removal of a feature map from the network. A given channel of weights that operates on a given input produces an output feature map of outputs. It may be possible to identify good candidates for pruning by selecting feature maps with low or zero outputs.

Output values across inferences on multiple inputs can be gathered, and their results summarized using statistical measures. Saliency metrics use the sum [37], mean and variance [38], or L2-norm [36] of feature maps to identify low saliency channels.

The feature map produced by the convolution layer is not the only feature map that can be used. For example, the absolute percentage of zeros (APoZ) [22] counts the percentage of zero values in the output activations for a given channel, and computes the average across multiple inputs.

[TABLE]

Quite often, convolution layers are followed by ReLU layers which only retain positive outputs. A negative mean would indicate that on average the outputs produced by that channel were negative and likely to be driven to zeros by ReLU. Hence, APoZ considers the average of the output points after ReLU. APoZ considers channels with a higher percentage of zeroes to have a lower saliency.

[TABLE]

A sub-category of saliency metrics that use weights and input images are metrics inspired from reconstruction error. Saliency metrics that are based on reconstruction error have a pointwise metric in the form of $x-reconstruction(x)$ . In the case of ThiNet [32], $x$ is a point sampled from the input feature map and $reconstruction(x)$ is its reconstruction after pruning. Metrics used by ThiNet [32], He et al. [33], and Liu et al. [34] choose channels based on the least error incurred to output feature maps. Hence, the layer-wise feature maps error is used as a saliency metric. The main difference between the approach explored by ThiNet [32] and He et al. [33] is how they estimate the damaged feature map. Liu et al. [34] also introduce a layer-wise loss alongside the reconstruction error.

7 Weight, Input Images and Labels Based Saliency Metrics

Using only the weights and input images to compute saliency metrics does not allow one to know directly how the classification performance is being affected. Corresponding labels are also needed to obtain this information. The loss is a measure of how well the predictions of the network match the ground truth. Consequently, the gradients with respect to the loss also carry this information. A network that has reached its minimum loss has a gradient of zero.

The equation of the cross-entropy loss, $\mathcal{L}$ , is given by Equation 12 for a network with weights $W$ and input dataset $\mathcal{I}_{set}$ . The dataset $\mathcal{I}_{set}$ contains $N$ pairs of input images, $\prescript{}{n}{I}$ , and labels (or true vector of probabilities classifying $\prescript{}{n}{I}$ ), $\prescript{}{n}{L}$ . $P$ gives the output of the network, i.e., the vector of probabilities classifying the input image.

[TABLE]

$P$ is evaluated during a forward pass of the network using only the weights and input images. On the other hand evaluating $\mathcal{L}$ , requires the ground truth and so do the gradients of the loss. Hence, the use of the $loss$ and its gradients carry more information than using only the weights or output feature maps. To compute the gradient of the loss, a backward pass as well as a forward pass is required.

The use of gradients in saliency metrics for pruning was introduced by Mozer and Smolensky’s Skeletonization [7], Lecun et al.’s Optimal Brain Damage [9], and Hassibi and Stork’s Optimal Brain Surgeon [10].

A simpler gradient based saliency measure was proposed by Liu and Wu [48] where the average of the gradients of the output feature maps (Equation 13) is used as a saliency measure of a channel. They put forward that pruning channels that are no longer updated by the SGD algorithm can be pruned safely.

[TABLE]

While Optimal Brain Damage introduced the use of Taylor expansions for deriving saliency methods, other more recent approaches have also used Taylor expansions to obtain different saliency metrics. These methods are further explained in section 7.2

Saliency metrics that use a Taylor expansion estimate the error caused to the final loss of the network. Dong et al. [31] introduce a layer-wise error, hence a layer-wise sensitivity and propose a method to propagate this layer-wise sensitivity to deduce the final impact on the network. They use the saliency measure introduced by Optimal Brain Surgeon [10] to estimate the layer-wise sensitivity.

7.1 Connection sensitivity

Lee et al. [12] define a saliency measure, Connection Sensitivity, based on gradients of a mask term. Each individual weight, $w_{i}$ , have a mask term, $m_{i}$ , that can be either one or zero. Their saliency measure is derived using the gradients of the mask terms instead of the gradients of the weights.

[TABLE]

To facilitate the comparison of Connection Sensitivity to other saliency metrics, we remind a few notations. We use $W$ to denote the vector of the weights and $M$ , the vector of the mask terms. $M$ and $W$ are of similar dimensions. The vector of pruned weights, $\widetilde{W}$ with $i^{th}$ element $\widetilde{w_{i}}$ , is given by applying the mask on the weights with $\widetilde{W}=M\odot W$ or $\widetilde{w_{i}}=m_{i}\cdot w_{i}$

Using this substitution, in Equation 15, we obtain Equation 16. The gradient of $\widetilde{W}$ with respect to the loss, $\frac{d\mathcal{L}}{d\widetilde{W}}$ , is given by regular backpropagation rules.

In the case of Connection Sensitivity [12], since $M=1$ , i.e. all the components of M are set to 1, we can express the gradients of the mask terms in terms of the known gradients of the weights. From Equation 18, we see that in this case using the absolute value of the gradient of the mask, $\left\lvert g_{i}\right\rvert$ , is similar to using the absolute value of the first term of a Taylor expansion, $\left\lvert w_{i}\frac{d\mathcal{L}}{dw_{i}}\right\rvert$ , from Equation 22

[TABLE]

7.2 Taylor Expansion

Most of the gradient based saliency measures discussed in this paper can be summarized as the estimation of the sensitivity of the parameters removed by removing a convolutional filter channel from a network. The sensitivity of a parameter was first introduced for fully-connected layers [7] as the change in the error of the network on the training set caused by removing that parameter. This definition can be extended to convolution channels by using the change in error induced by removing the set of parameters associated with that channel. The sensitivity of pruning a network with loss, $\mathcal{L}$ , and unpruned weights, $W$ , to pruned weights, $\widetilde{W}$ , is given in Equation 19

[TABLE]

One of the first approaches to estimate the sensitivity of a weight was proposed by Lecun et al.’s Optimal Brain Damage [9]. They use a simplified second order Taylor expansion on the trained neural network. A Taylor expansion is used to estimate the loss function, $\mathcal{L}$ , at the pruned weights, $\widetilde{W}$ , using the trained weights, $W$ .

[TABLE]

A second order Taylor expansion around the trained weights, $W$ , is given in equation 20, where $J$ and $H$ are respectively the Jacobian and Hessian of the loss function at trained parameters $W$ . The Jacobian matrix, $J\in\mathcal{R}^{N_{weights}}$ , is defined as $J_{i}=\frac{d\mathcal{L}}{dwi}$ and the Hessian matrix, $H\in\mathcal{R}^{N_{weights}\times N_{weights}}$ , is defined as $H_{ij}=\frac{d^{2}\mathcal{L}}{dw_{i}dw_{j}}$

To prune an individual weight, ${w_{i}}$ , from the network, ${w_{i}}$ is set to zero, i.e. the $i^{th}$ parameter of $\widetilde{W}$ is set to zero. To easily understand how the saliency of a single parameter is derived, Equation 20 can be rewritten for pruning a single parameter, $w_{i}$ , in Equation 21.

[TABLE]

The approximation of the sensitivity given by the second order Taylor expansion can be used as a saliency metric [9]. Hence, the saliency of a single weight is given by Equation 22. Similarly, pruning a set of parameters means setting these parameters to zero in $\widetilde{W}$ .

[TABLE]

Computing Equation 20 exactly is very expensive. While the first order term of the equation (the term involving the gradients) is computed in linear time, the higher order terms are more difficult to obtain. The Hessian matrix scales with the quadratic of the number of weights in the network. To better understand the difference in computation and memory cost, let us consider the number of operations to compute each element of a feature map through backpropagation. During backpropagation the gradients of the output feature map of a layer is given by the next layer’s backpropagation. It is the gradients of the input feature map that are computed during backpropagation. Given a layer $l$ , the gradients $\frac{d\mathcal{L}}{d\prescript{l}{}{A}}$ are known, and $\frac{d\mathcal{L}}{d\prescript{l-1}{}{A}{}}$ is computed during the $l^{th}$ backward pass. Computing each point of $\frac{d\mathcal{L}}{d\prescript{l-1}{}{A}}$ has a complexity $O(\prescript{l}{}{m}\times(\prescript{l}{}{k})^{2})$ . There are $\prescript{l-1}{}{m}\times\prescript{l-1}{}{height}\times\prescript{l-1}{}{width}$ such points to be computed for the gradient of the input feature map. To understand the higher complexity of computing the full Hessian matrix, let us consider the computational cost of the layer-wise Hessian using chain rule. If we did a full back propagation of the layer-wise Hessian, we would instead need to compute $\frac{(\prescript{l-1}{}{m}\times\prescript{l-1}{}{height}\times\prescript{l-1}{}{width})^{2}}{2}$ points each having a complexity $O(\prescript{l}{}{m}\times(\prescript{l}{}{k})^{4}$ ).

To reduce computation and storage cost, different approximations can be applied to the different terms of Equation 20 [9, 10, 23, 40] to obtain different saliency metrics. Popular approximations are presented in Figure 4. Table 4 summarizes various approximations of Equation 20 that have been used as saliency metrics.

Even though different approximations are used, the resulting saliency metrics can be very similar. Equation 24 is used by Theis et al. [40] is very similar to Equation 23 used by Molchanov et al. [23].

[TABLE]

7.2.1 Consider only first order terms

A first order expansion can also be used to approximate the change in loss. The second order terms in the Equation 20 can be set to zero to get the saliency metrics given by considering a first order Taylor expansion. The resulting equation has all its quantities readily available during backpropagation. The Jacobian matrix (i.e. the matrix of gradients) is computed during backpropagation. Equation 23 is used by Molchanov et al. [23] as a saliency metric.

A first order Taylor approximation is theoretically a coarser approximation of the real function than a second order Taylor expansion. However, it has the advantage of omitting computation of higher order derivatives. In practice, we can see that good saliency metrics can be derived from a first order Taylor expansion [23, 16].

7.2.2 Consider only second order terms

On the other hand, other work choose to neglect the first order term. A common assumption is that if a network has been trained to a local minimum, its first order derivatives will be very close to zero. This assumption about the network’s convergence, then allows the first order term (containing the gradients) to be approximated to zero. This a common approximation when using second order Taylor expansions [9, 10, 40, 49, 31].

7.2.3 Approximation of second order terms

Computing the estimated terms using a second order Taylor expansion can be expensive due to the cost of computing the Hessian matrix of the loss function for every parameter. To reduce computation cost of the Hessian matrix, the terms that are not on its diagonal can be ignored [40, 9]. Considering only the diagonal terms of the Hessian reduces the number of points to be computed. The number of terms on the diagonal of the Hessian is equal to the number of terms in the Jacobian (gradients). To further reduce the computation cost of the remaining terms, one can use more approximations. The remaining terms Hessian can be estimated using a Levenberg-Marquardt approximation for each layer of the network [9], the Fisher information [40] or a Gauss-Newton approximation. The Levenberg-Marquardt approximation used by Optimal Brain Damage [9] propagate only the diagonal Hessian. Hence, its computational cost is similar to backpropagating the gradients. Optimal Brain Surgeon [10, 52] also use the Fisher matrix as an approximation for the Hessian, however they do not neglect the non-diagonal terms. By including the non-diagonal terms of the Hessian matrix, Optimal Brain Surgeon [10] consider the pairwise dependency between parameters.

8 Experimental setup

8.1 Saliency metrics

We selected a subset of metrics that can be derived from Table 1, excluding $alternateBackprop$ (metrics based on an alternative backpropagation) and $reconstruction$ (metrics based on reconstruction error).

The second order Taylor expansion expressed in Table 1 cannot be realistically computed exactly. The approximations seen in Figure 4 are used for the second order expansion. We choose to fix the shape of the Hessian to a diagonal matrix and use either an expensive (Levenberg-Marquardt) or a cheap (Gauss-Newton with $H_{\sigma}=1$ ) algorithm to compute the remaining terms. Neglecting the first order terms in a second order Taylor expansion being a popular approximation [9, 10, 40, 49], we test this approximation, leading to a total of 4 different pointwise saliency metrics that use the Hessian.

We test all the saliency metrics that can be derived from the 2 base inputs, 7 different pointwise saliency metrics, 5 different reduction methods and 5 different scaling factors. In practice, we obtain 308 saliency metrics with different rankings for convolution channels.

8.2 Pruning scheme

Saliency metrics are embedded into a pruning scheme to remove weights from the network. Pruning schemes can range from simply removing a fixed number of channels every iteration [4], to the use of reinforcement learning [27], evolutionary particle filters [37] and even genetic algorithms [53]. While state-of-the-art pruning schemes push the boundaries of pruning further, simple pruning schemes are still very efficient [4, 6, 11]. Simple pruning schemes heavily rely on how well the saliency metric can predict the least salient entities. Since our goal is to compare saliency metrics to each other with the least number of confounding factors, we opt for simple pruning schemes. We evaluate the performance of the saliency metrics using the pruning scheme given in Algorithm 1.

We implement channel pruning so that at every step of the pruning process, we have a dense network with fewer weights than before pruning. Using Algorithm 1, we select a channel to prune from some convolutional layer of the network at each step. Where we would remove a channel which participates in a join-type operation in directed acyclic graph (DAG) structured networks (for example, pruning a channel from one side of a skip connection in ResNet), we also remove the corresponding DAG sibling channels, so that data dependencies are satisfied and the network remains dense. The weights of the dependent channels are included in $\mathcal{TC}(\prescript{l}{}{W}_{i})$ .

8.3 Datasets

We run our experiments using three different datasets: CIFAR-10 [54], CIFAR-100 [54] and a downsampled ImageNet [55]. For ImageNet-32, the images from ILSVRC 2012 challenge are downsampled to $32\times 32$ pixels [55]. These three datasets all use $32\times 32$ RGB input images with 10, 100 and 1000 different classes respectively.

These three datasets each have their own disjoint training set, $\mathcal{I}_{train}$ , and testing set $\mathcal{I}_{test}$ . We train the CNNs on the whole training set, $\mathcal{I}_{train}$ , and measure their test accuracy using $\mathcal{I}_{test}$ .

We split $\mathcal{I}_{train}$ into two disjoint sets $\mathcal{I}_{val}$ and $\mathcal{I}_{retrain}$ . $\mathcal{I}_{val}$ is used for computing the saliency metrics for channel pruning. $\mathcal{I}_{retrain}$ is used only during the retraining phase.

8.4 CNN models

We conduct a wide range of experiments using different networks and different datasets.

We use the CIFAR-10 [54] dataset on LeNet, CIFAR10 network, ResNet-20, NIN, and AlexNet. We use ResNet-20 [56], and NIN [57] as originally described for the CIFAR-10 dataset. We modify the first layer of LeNet-5 [58] and AlexNet [59] to process $32\times 32$ RGB images instead of their original input.

We use the CIFAR-100 [54] dataset on ResNet-20, NIN, and AlexNet. We use a downsampled ImageNet [55], ImageNet-32, on AlexNet. The downsampled ImageNet contains the same number of images and classes as original ImageNet [60] but have each image resized to 32 by 32 pixels.

We maintain the same input size for all our networks to 32 by 32 pixels. If a network is used for different datasets, the only structural change we apply to that network is modifying the its last layer to classify either 10, 100 or 1000 classes.

These nine networks are trained from scratch using Caffe [61] and their test accuracies are given in Table 5.

9 Results

We begin by considering the three broad categories of saliency metrics proposed in Figure 1. A naturally occuring question is to identify the absolute best-performing method in each of our experimental scenarios, and to determine the best pruning we can obtain of each network on each dataset in experiments.

In Table 6, we show the results for the best performing saliency metric in each information category for each network. In this evaluation, we compare the number of weights pruned allowing a drop in TOP-1 accuracy of at most 5%. When this threshold is exceeded, we stop the experiment and take the last snapshot of the model which was above the threshold as the candidate pruned model. We repeat this experiment for 8 runs, and obtain the mean percentage of weights removed and a 95% confidence interval for the mean. This approach is used for all of the experimentation presented.

We can see that gradients-based methods are typically the best-performing in experiments. However, there is no one standout method in our experiments, but rather we see that different saliency metrics obtain the best results on different networks or with different datasets. Table 7 shows the saliency metrics corresponding to highlighted results in Table 6.

Taking AlexNet as an example, we see that on each of the three classification tasks, a different saliency metric produced the best results. On the CIFAR-10 and CIFAR-100 tasks, a gradient-based metric was ultimately most effective, while on the ImageNet task, a pure weights-based metric won out. The choice of which gradients to consider also had an effect. For CIFAR-10, a weight-oriented metric ( $X=W$ ) was most effective, while on CIFAR-100 an activation-oriented metric ( $X=A$ ) was most effective.

While these selected results show the best-performing metrics in our experiments, we performed the same evaluation for all 308 candidate saliency metrics. We cannot present data for all 308 experiments, but we summarize the trends from the data in the remainder of this section as Findings, which highlight key trends with examples as appropriate. Only a small fraction of these 308 saliency metrics have been explored in previous literature. The data in Tables 6 and 7 lead us to:

Finding 1

Gradient-based metrics typically perform better than metrics which consider only weights or activations.**

Metrics which use gradients require a full forward and backward pass of the network to compute those gradients, as opposed to activation-based methods, which require only a forward pass to compute activation values, or weight-based metrics which require no computation beyond the application of the saliency metric function to the weight values stored in the model. In our experiments, gradient-based metrics yielded the best pruning or tied for the best pruning in 7 of 9 scenarios with different networks and datasets (Table 6).

9.1 Weight-Based or Activation-Based Methods (Choice of $X$ )

In Table 8 and 9, we present the saliency achieved by each pointwise metric when $X=W$ and $X=A$ respectively. From Table 8 and 9 we see that when using pure weight-based metrics or activation-based metrics there is not a clear cut winner. However, when the gradients of either the weights or output points are used, it is almost always preferable to use the gradients of output points.

Finding 2

Pruning using the gradient with respect to the output points often outperforms pruning using the gradient with respect to the weights.**

With channel pruning, there is a one-to-one correspondence between groups of weights and groups of output points. However, at finer granularities there is no one-to-one correspondence, but rather a one-to-many correspondence. With this mixing of information at finer granularities, we expect the gradient with respect to output points to be a less reliable signal for non-channel-oriented pruning.

9.2 Pointwise metric (choice of $F$ )

In order to compare pointwise saliency metrics, we need to fix the reduction $R$ and scaling $K$ to reasonable choices which can be expected to give good results on average. Equation 25 shows a common choice in the literature [4, 43, 26, 6, 27, 42, 16]. Here the reduction $R$ is the L1-norm, and the scaling factor $K=1$ .

[TABLE]

For constructing pointwise metrics, the use of a Taylor expansion around the loss function is very common in the literature. We tested five Taylor expansions which use different approximations. From the results in Tables 9 and 8, we observe that some approximations are often poor. In particular, neglecting the first order terms (the gradient) when using a second order Taylor expansion around the loss function is a poor approximation in most cases. The degree to which the training process has converged before pruning affects the magnitudes of gradients, meaning they may still be quite large in many cases, so assuming that they are universally close to zero can be a very coarse approximation.

Finding 3

First order terms are often not negligible in second order Taylor expansions.**

The second order term in the Taylor expansion $-x\frac{d\mathcal{L}}{dx}+\frac{x^{2}}{2}\frac{{d}^{2}\mathcal{L}}{d{x}^{2}}$ corresponds to the Hessian of the loss function, which is very expensive to compute. Several approximations of the Hessian have been used in the literature. Using a Levenberg-Marquardt approximation requires a very expensive backward propagation of the second order derivatives. This is not commonly implemented in popular deep learning frameworks, because only the first derivative (i.e. the gradient) is required for training. We found that, in the majority of cases in our experiments, a Gauss-Newton appproximation of the Hessian is sufficient for pruning. In Tables 8 and 9, we see that the Levenberg-Marquadt approximation is only clearly advantageous in one case, when pruning NIN on CIFAR-100.

Finding 4

The Gauss-Newton approximation of the Hessian is sufficiently accurate for pruning.**

9.3 Reduction (choice of $R$ )

When pruning blocks of weights, the pointwise metrics are combined into a single value using some reduction function. Existing research on pruning often places little focus on reduction and scaling methods, and it can sometimes be difficult to identify the approach used in any given method. Nonetheless, the method by which the pointwise metric is reduced and scaled can greatly influence the quality of a saliency metric.

Finding 5

Strictly positive saliency metrics offer better pruning results.**

Figure 5 presents the results of our experimentation grouped so that only the reduction and scaling methods vary. Figure 5a presents a summary for all choices of input and all pointwise metrics. We then examine the three families of metrics outlined in Figure 1 in more detail: metrics which consider static information (i.e. weights) only ( $X=W$ , Figure 5b), metrics which consider information available with only a forward pass, i.e. weights and output feature maps, ( $X=A$ , Figure 5c) and finally metrics which consider all information available from a forward and backward pass ( $X=A$ and $f(x)$ involves a gradient, Figure 5d).

As a general trend, we see that using the same pointwise saliency metric but varying the reduction or scaling methods can produce very different results. For example, we see that using the raw sum of the gradients (first bar set in each graph) typically results in poorly performing metrics versus other reduction methods. In each scenario, reducing by the sum of the absolute values of the gradients produces significantly better results (second bar set in each graph).

We observe that the gap between the simple summation and the other guaranteed-positive reduction methods is smaller in Figure 5c where we use $X=A$ , i.e. the gradient with respect to output features. In our experiments, networks containing ReLU layers are optimized such that ReLU is fused with the convolution layer, meaning the resulting output features are non-negative. This means the simple summation is also guaranteed non-negative, which improves the quality of pruning decisions significantly.

9.4 Scaling (choice of $K$ )

In Figures 5b and 5c we see that two scaling methods stand out when we ignore the typically poorly-performing first bar group, which corresponds to the simple-summation reduction method. Across all four remaining guaranteed-positive reduction methods, two scaling factors trade blows for first and second place in terms of the quality of pruning decisions: $\left\lVert\prescript{l}{}{\widetilde{S}}\right\rVert_{1}$ and $n(\mathcal{TC}(\prescript{l}{}{W}_{i}))$ .

Recall from Section 3.4 that $\left\lVert\prescript{l}{}{\widetilde{S}}\right\rVert_{1}$ is the layer-wise L1 norm of saliency values, and $\mathcal{TC}(\prescript{l}{}{W}_{i})$ denotes the entire set of weights transitively removed when $\prescript{l}{}{W}_{i}$ is removed from the network. Both of these scaling factors incorporate structural information, as opposed to strictly local information about the parameters being pruned.

Finding 6

Incorporating structural information of the network in the scaling factor offers better pruning results.**

In Figure 5b, we can directly compare scaling by the local ( $n(\prescript{l}{}{W}_{i})$ ) and transitive ( $n(\mathcal{TC}(\prescript{l}{}{W}_{i}))$ ) number of weights removed. These are the second-to-last and last bars in each bar set, respectively. While the improvement of using $n(\mathcal{TC}(\prescript{l}{}{W}_{i}))$ is sometimes small, it is strictly better than using only the local information in each case when considering guaranteed-positive reduction methods.

In Figure 5d, we look at how a good gradient-based pointwise metric can be affected by reduction and scaling. Similarly to non-gradient-based metrics in Figures 5b and 5c, we see that the use of a reduction method that is guaranteed positive led to significantly better results. The gradient-based metric results further highlight the benefit of using non-local information. The best overall pruning result is achieved using $n(\mathcal{TC}(\prescript{l}{}{W}_{i}))$ as the scaling factor (fourth bar set in the figure), but using the local scaling factor $n(\prescript{l}{}{W}_{i})$ while keeping everything else fixed makes the quality of the metric plummet – in this case, by nearly 20 percentage points average sparsity achieved across all networks and datasets in our experiments.

Finding 7

The number of weights transitively removed is a better scaling factor better than the number of weights locally removed.**

9.5 Saliency Metrics and Retraining Iterations

Retraining or fine-tuning is a crucial step in many pruning algorithms because it allows the network to adjust remaining parameters to compensate for the damage done by the removal of pruned parameters. However, when evaluating saliency metrics, retraining is a confounding factor because it can arbitrarily change weight values. In fact, the effect of retraining is so great that we can often compensate for the suboptimal choices made by poor saliency metrics with enough retraining.

Confounding factors notwithstanding, it seems intuitive that a better saliency metric, should greatly reduce the effort spent on retraining to achieve a given target accuracy and sparsity. Better saliency metrics do less damage to the network to attain a given threshold minimum sparsity, and conversely also result in higher achievable sparsity ratios for a given threshold minimum accuracy. Thus, we expect the total computational cost of pruning (retraining included) to be reduced by choosing a higher-quality saliency metric.

We examine the relationship between the quality of a saliency metric and the total computational cost of pruning. As a proxy for the quality of a saliency metric we use the maximum sparsity achieved using that metric without any retraining, i.e. using Algorithm 2. The pruning scheme outlined in Algorithm 2, is similar to our previous pruning scheme, except for the omission of retraining steps.

We refer to the total cost of pruning as the total computational cost to reach a certain sparsity while maintaining a fixed accuracy target. Hence, in addition to the accuracy threshold in Algorithm 1, we add a stopping condition related to the sparsity. When this target is met we stop the experiment and record the total number of steps that were required to prune and retrain the network. The target sparsity ratio chosen for each network on each dataset was the best achievable sparsity from Table 6 minus 5%. This additional target allows us to filter out very poor saliency metrics and to compare pruned networks of similar size and accuracy.

The total cost of pruning is given by the sum of the cost of computing the saliency metric and the cost of retraining. The cost of a single retraining step is the cost of one backward and one forward pass of the network. To easily compare cost, we assume that the cost of a backward pass is twice that of a forward pass. The cost of computing a saliency metric is given in Table 11.

In Figure 6, we see the result of this experiment for AlexNet on the CIFAR-10 dataset. Each point on the graph is one saliency metric. We see that many saliency metrics are able to be used in Algorithm 2 to meet both the sparsity and accuracy targets. However, as predicted, poorer saliency metrics result in much more retraining being required to reach a network of the given quality than good saliency metrics. To quantify our results in terms of the correlation of metric quality and pruning cost, we use the Spearman rank correlation. The correlation for AlexNet on CIFAR-10 (as presented in Figure 6 is $-0.9$ . A similar trend is observed (negative correlation in Table 10) for the other networks except for ResNet-20 on CIFAR-10 where we had only 3 metrics meeting the targets. In the case of AlexNet trained on CIFAR-10 (Figure 6), with a fixed sparsity and accuracy target and starting from the same initial trained network, the difference in total cost of pruning was approximately 75 $\times$ between the best and worst saliency metric that meant both the sparsity and accuracy target.

Finding 8

Better saliency metrics greatly reduce retraining requirements in pruning.**

10 Conclusion

Although many pruning strategies have been proposed, a common element is the saliency metric that seeks to identify unimportant parameters. We propose a taxonomy that characterizes existing saliency metrics, by combining elements from four mostly-orthogonal components: (1) base input, (2) pointwise metric, (3) reduction, and (4) scaling. This allows us to identify common components among saliency metrics in the many existing pruning strategies, and to derive novel metrics by combining elements from four mostly-orthogonal components. We experimentally evaluate 308 such metrics.

We confirm some well-known results, like that gradient-based methods are significantly better than simpler methods based purely on the weights or output activations. But we also find new insights. For example, metrics that use the gradient with respect to outputs tend to outperform those using the gradient with respect to weights. The most successful gradient-based methods use a second-order Taylor expansion. Within this expansion, the first order term contains important information that should not be omitted, but the Gauss-Newton approximation is sufficient for the second order term. Otherwise good metrics can be easily undermined by a poor reduction, such as simply adding pointwise terms. On the other hand there is scope for significant improvements in scaling factors containing structural information, such as our novel scaling based on transitively-pruned parameters.

The saliency metric is just one component of pruning algorithms, but it has a critical impact on the success of pruning. We anticipate that our taxonomy and evaluation will guide practitioners to the best existing saliency metrics, and direct researchers to open new frontiers in the design space.

Acknowledgement

This work was supported with the financial support of the Science Foundation Ireland grant. This work was also supported, in part, by Arm Research.

Bibliography61

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Joel Hestness, Newsha Ardalani and Gregory F. Diamos “Beyond Human-level Accuracy: Computational Challenges in Deep Learning” In Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, P Po PP 2019, Washington, DC, USA, February 16-20, 2019 ACM, 2019, pp. 1–14 DOI: 10.1145/3293883.3295710 · doi ↗
2[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun “Delving Deep into Rectifiers: Surpassing Human-Level Performance on Image Net Classification” In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015 IEEE Computer Society, 2015, pp. 1026–1034 DOI: 10.1109/ICCV.2015.123 · doi ↗
3[3] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato and Lior Wolf “Deep Face: Closing the Gap to Human-Level Performance in Face Verification” In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014 IEEE Computer Society, 2014, pp. 1701–1708 DOI: 10.1109/CVPR.2014.220 · doi ↗
4[4] Huizi Mao et al. “Exploring the Granularity of Sparsity in Convolutional Neural Networks” In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017 IEEE Computer Society, 2017, pp. 1927–1934 DOI: 10.1109/CVPRW.2017.241 · doi ↗
5[5] Wei Wen et al. “Learning Structured Sparsity in Deep Neural Networks” In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain , 2016, pp. 2074–2082 URL: http://papers.nips.cc/paper/6504-learning-structured-sparsity-in-deep-neural-networks
6[6] Song Han, Jeff Pool, John Tran and William J. Dally “Learning both Weights and Connections for Efficient Neural Network” In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada , 2015, pp. 1135–1143 URL: http://papers.nips.cc/paper/5784-learning-both-weights-and-connections-for-efficient-neural-network
7[7] Michael Mozer and Paul Smolensky “Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment” In Advances in Neural Information Processing Systems 1, [NIPS Conference, Denver, Colorado, USA, 1988] Morgan Kaufmann, 1988, pp. 107–115 URL: http://papers.nips.cc/paper/119-skeletonization-a-technique-for-trimming-the-fat-from-a-network-via-relevance-assessment
8[8] Ehud D. Karnin “A Simple Procedure for Pruning Back-propagation Trained Neural Networks” In IEEE Trans. Neural Networks 1.2 , 1990, pp. 239–242 DOI: 10.1109/72.80236 · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Taxonomy of Saliency Metrics for Channel Pruning

Abstract

1 Introduction

Contributions

2 Background

3 A Taxonomy of Saliency Metrics

3.1 Domain (choice of XXX)

3.2 Pointwise metric (choice of FFF)

3.3 Dimensionality Reduction (choice of RRR)

3.4 Scaling (choice of KKK)

3.4.1 Proposed scaling method, TC\mathcal{TC}TC

3.5 Minibatches

4 Classification of existing saliency metrics

5 Weight based Saliency Metrics

5.0.1 Recursive Weight Based Metrics

6 Weight and Input Images Based Saliency Metrics

7 Weight, Input Images and Labels Based Saliency Metrics

7.1 Connection sensitivity

7.2 Taylor Expansion

7.2.1 Consider only first order terms

7.2.2 Consider only second order terms

7.2.3 Approximation of second order terms

8 Experimental setup

8.1 Saliency metrics

8.2 Pruning scheme

8.3 Datasets

8.4 CNN models

9 Results

Finding 1

9.1 Weight-Based or Activation-Based Methods (Choice of XXX)

Finding 2

9.2 Pointwise metric (choice of FFF)

Finding 3

Finding 4

9.3 Reduction (choice of RRR)

Finding 5

9.4 Scaling (choice of KKK)

Finding 6

Finding 7

9.5 Saliency Metrics and Retraining Iterations

Finding 8

10 Conclusion

Acknowledgement

3.1 Domain (choice of $X$ )

3.2 Pointwise metric (choice of $F$ )

3.3 Dimensionality Reduction (choice of $R$ )

3.4 Scaling (choice of $K$ )

3.4.1 Proposed scaling method, $\mathcal{TC}$

9.1 Weight-Based or Activation-Based Methods (Choice of $X$ )

9.2 Pointwise metric (choice of $F$ )

9.3 Reduction (choice of $R$ )

9.4 Scaling (choice of $K$ )