Pre-Defined Sparse Neural Networks with Hardware Acceleration

Sourya Dey; Kuan-Wen Huang; Peter A. Beerel; Keith M. Chugg

arXiv:1812.01164·cs.LG·October 30, 2024

Pre-Defined Sparse Neural Networks with Hardware Acceleration

Sourya Dey, Kuan-Wen Huang, Peter A. Beerel, Keith M. Chugg

PDF

2 Repos

TL;DR

This paper introduces a method for pre-defined sparse neural networks that significantly reduces computational and storage complexity, along with a flexible FPGA-compatible hardware architecture supporting training and inference.

Contribution

It proposes a novel pre-defined sparsity approach and a flexible hardware architecture for neural network acceleration compatible with various network sizes.

Findings

01

Storage and computational complexity reduced by over 5X

02

Supports both training and inference modes

03

Compatible with various FPGA sizes

Abstract

Neural networks have proven to be extremely powerful tools for modern artificial intelligence applications, but computational and storage complexity remain limiting factors. This paper presents two compatible contributions towards reducing the time, energy, computational, and storage complexities associated with multilayer perceptrons. Pre-defined sparsity is proposed to reduce the complexity during both training and inference, regardless of the implementation platform. Our results show that storage and computational complexity can be reduced by factors greater than 5X without significant performance loss. The second contribution is an architecture for hardware acceleration that is compatible with pre-defined sparsity. This architecture supports both training and inference modes and is flexible in the sense that it is not tied to a specific number of neurons. For example, this…

Tables3

Table 1. TABLE I: Hardware Architecture Total Storage Cost Comparison for 𝑵 net = ( 800 , 100 , 10 ) subscript 𝑵 net 800 100 10 \bm{N}_{\mathrm{net}}=(800,100,10) \ac FC vs sparse with 𝒅 net out = ( 20 , 10 ) subscript superscript 𝒅 out net 20 10 \bm{d}^{\mathrm{out}}_{\mathrm{net}}=(20,10) , ρ net = 21 % subscript 𝜌 net percent 21 \rho_{\mathrm{net}}=21\%

Parameter	Expression	Count (\acFC)	Count (sparse)
$𝒂$	$\sum_{i = 0}^{L - 1} (2 (L - i) + 1) N_{i}$	4300	4300
$\dot{𝒂}$	$\sum_{i = 1}^{L - 1} (2 (L - i) + 1) N_{i}$	300	300
$𝜹$	2 $\sum_{i = 1}^{L} N_{i}$	220	220
$𝒃$	$\sum_{i = 1}^{L} N_{i}$	110	110
$𝑾$	$\sum_{i = 1}^{L} N_{i} d_{i}^{in}$	81000	17000
TOTAL	$Σ$ (All above)	85930	21930

Table 2. TABLE II: Comparison of Pre-Defined Sparse Methods

$𝒅_{net}^{out}$	$ρ_{net} %$	$𝒛_{net}$	Test Accuracy Performance
$𝒅_{net}^{out}$	$ρ_{net} %$	$𝒛_{net}$	Clash-free	Structured	Random
MNIST: $𝑵_{net} = (800, 100, 100, 100, 10)$ , \acFC test accuracy = $98 \pm 0.1$
$(80, 80, 80, 10)$	$80.2$	$(200, 25, 25, 4)$	$97.9 \pm 0.2$	$97.9 \pm 0.2$	$97.8 \pm 0.2$
$(60, 60, 60, 10)$	$60.4$	$(200, 25, 25, 4)$	$97.6 \pm 0.1$	$97.8 \pm 0.1$	$97.6 \pm 0.2$
$(40, 40, 40, 10)$	$40.6$	$(200, 25, 25, 5)$	$97.5 \pm 0.1$	$97.7$	$97.6 \pm 0.1$
$(20, 20, 20, 10)$	$20.8$	$(200, 25, 25, 10)$	$97.2 \pm 0.2$	$97.2 \pm 0.1$	$97.1 \pm 0.1$
$(10, 10, 10, 10)$	$10.9$	$(200, 25, 25, 25)$	$96.7 \pm 0.1$	$96.8 \pm 0.2$	$96.7 \pm 0.2$
$(5, 10, 10, 10)$	$6.9$	$(100, 25, 25, 25)$	$96.3 \pm 0.1$	$96.3 \pm 0.1$	$96.2 \pm 0.1$
$(2, 5, 5, 10)$	$3.6$	$(80, 25, 25, 50)$	$95 \pm 0.2$	$95.1 \pm 0.1$	$95 \pm 0.3$
$(1, 2, 2, 10)$	$2.2$	$(80, 20, 20, 100)$	$93.3 \pm 0.3$	$93.1 \pm 0.5$	$92 \pm 0.3$
Reuters: $𝑵_{net} = (2000, 50, 50)$ , \acFC test accuracy = $89.6 \pm 0.1$
$(25, 25)$	$50$	$(1000, 25)$	$89.4 \pm 0.1$	$89.3$	$89.4$
$(10, 10)$	$20$	$(400, 10)$	$87 \pm 0.1$	$86.7 \pm 0.1$	$86.5 \pm 0.1$
$(5, 5)$	$10$	$(200, 5)$	$78.5 \pm 0.5$	$78.2 \pm 0.7$	$77.5 \pm 0.6$
$(2, 2)$	$4$	$(80, 2)$	$53.3 \pm 1.8$	$51.2 \pm 1.7$	$46.8 \pm 2.9$
$(1, 1)$	$2$	$(40, 1)$	$28.4 \pm 2.4$	$28.7 \pm 2.3$	$28 \pm 1.9$
TIMIT: $𝑵_{net} = (39, 390, 39)$ , \acFC test accuracy = $43.2 \pm 0.2$
$(270, 27)$	$69.2$	$(13, 13)$	$43 \pm 0.1$	$43$	$43 \pm 0.1$
$(180, 18)$	$46.2$		$42.7 \pm 0.1$	$42.8 \pm 0.1$	$42.9 \pm 0.1$
$(90, 9)$	$23.1$		$42.1 \pm 0.1$	$42.5 \pm 0.1$	$42.4 \pm 0.1$
$(60, 6)$	$15.4$		$41.5 \pm 0.1$	$41.8 \pm 0.2$	$41.9 \pm 0.1$
$(30, 3)$	$7.7$		$40.5 \pm 0.2$	$40.1 \pm 0.2$	$39.4 \pm 0.8$
CIFAR-100 ¹⁰¹⁰10For CIFAR-100, given values of $𝑵_{net}$ , $𝒅_{net}^{out}$ , $𝒛_{net}$ and $ρ_{net}$ are just for the \acMLP portion, which follows a \acCNN as described in Sec. IV-A to form the complete net. Reported values are top-5 test accuracies obtained from training on the complete net.: $𝑵_{net} = (4000, 500, 100)$ , \acFC top-5 test accuracy = $87.1 \pm 0.6$
$(100, 100)$	$22$	$(2000, 250)$	$87.5 \pm 0.2$	$87.7 \pm 0.2$	$87.4 \pm 0.3$
$(29, 29)$	$6.4$	$(2000, 250)$	$86.8 \pm 0.3$	$87.2 \pm 0.5$	$87.1 \pm 0.2$
$(12, 12)$	$2.6$	$(400, 50)$	$86.3 \pm 0.2$	$86.5 \pm 0.4$	$86.6 \pm 0.4$
$(5, 5)$	$1.1$	$(400, 50)$	$85.3 \pm 0.5$	$85.5 \pm 0.5$	$85.7 \pm 0.3$
$(2, 2)$	$0.4$	$(80, 10)$	$84.1 \pm 0.5$	$84.3 \pm 0.3$	$83.8 \pm 0.3$
$(1, 1)$	$0.2$	$(80, 10)$	$83 \pm 0.5$	$83.3 \pm 0.4$	$81.7 \pm 0.7$

Table 3. TABLE III: Comparison of Clash-Free Methods for a Single Junction i 𝑖 i with ( N i − 1 , N i , d i out , d i in , z i ) = ( 12 , 12 , 2 , 2 , 4 ) subscript 𝑁 𝑖 1 subscript 𝑁 𝑖 subscript superscript 𝑑 out 𝑖 subscript superscript 𝑑 in 𝑖 subscript 𝑧 𝑖 12 12 2 2 4 (N_{i-1},N_{i},d^{\mathrm{out}}_{i},d^{\mathrm{in}}_{i},z_{i})=(12,12,2,2,4)

Type	Memory	$S_{M_{i}}$	Storage Cost to Compute
Type	Dithering	$S_{M_{i}}$	Memory Addresses
1	No	81	$z_{i} = 4$
1	Yes	486	$2 z_{i} = 8$
2	No	6561	$z_{i} d_{i}^{out} = 8$
2	Yes	236k	$2 z_{i} d_{i}^{out} = 16$
3	No	1.68M	$N_{i - 1} d_{i}^{out} = 24$
3	Yes	60M	$(N_{i - 1} + z_{i}) d_{i}^{out} = 32$

Equations30

ρ_{net} = \frac{\sum _{i = 1}^{L} ∣ W _{i} ∣}{\sum _{i = 1}^{L} N _{i - 1} N _{i}}

ρ_{net} = \frac{\sum _{i = 1}^{L} ∣ W _{i} ∣}{\sum _{i = 1}^{L} N _{i - 1} N _{i}}

h_{i}^{(j)}

h_{i}^{(j)}

a_{i}^{(j)}

\overset{a}{˙}_{i}^{(j)}

δ_{L}^{(j)}

δ_{L}^{(j)}

δ_{i}^{(j)}

b_{i}^{(j)}

b_{i}^{(j)}

W_{i}^{(j, k)}

{W_{i}, b_{i}}_{i = 1}^{L} min l ({W_{i}, b_{i}}_{i = 1}^{L}) + λ r ({W_{i}}_{i = 1}^{L}) + i = 1 \sum L γ_{i} p (W_{i})

{W_{i}, b_{i}}_{i = 1}^{L} min l ({W_{i}, b_{i}}_{i = 1}^{L}) + λ r ({W_{i}}_{i = 1}^{L}) + i = 1 \sum L γ_{i} p (W_{i})

d_{i}^{out} = \frac{N _{i} d _{i}^{in}}{N _{i - 1}}, d_{i}^{in} \leq N_{i - 1}, d_{i}^{out}, d_{i}^{in} \in N

d_{i}^{out} = \frac{N _{i} d _{i}^{in}}{N _{i - 1}}, d_{i}^{in} \leq N_{i - 1}, d_{i}^{out}, d_{i}^{in} \in N

\left\{\rho_{i}\in(0,1]\bigg{|}\;\rho_{i}=\frac{k}{\mathrm{gcd}(N_{i-1},N_{i})},k\in\mathbb{N}\right\}.

\left\{\rho_{i}\in(0,1]\bigg{|}\;\rho_{i}=\frac{k}{\mathrm{gcd}(N_{i-1},N_{i})},k\in\mathbb{N}\right\}.

ρ_{1} \in {\frac{1}{39}, \frac{2}{39}, \dots, \frac{39}{39}}, ρ_{2} \in {\frac{1}{13}, \frac{2}{13}, \dots, \frac{13}{13}} .

ρ_{1} \in {\frac{1}{39}, \frac{2}{39}, \dots, \frac{39}{39}}, ρ_{2} \in {\frac{1}{13}, \frac{2}{13}, \dots, \frac{13}{13}} .

d_{i + 1}^{out} \geq \frac{d _{i}^{in}}{z _{i}} ⌈ \frac{z _{i}}{d _{i}^{in}} ⌉

d_{i + 1}^{out} \geq \frac{d _{i}^{in}}{z _{i}} ⌈ \frac{z _{i}}{d _{i}^{in}} ⌉

S_{M_{i}} = D_{i}^{z_{i}}

S_{M_{i}} = D_{i}^{z_{i}}

S_{M_{i}} = D_{i}^{z_{i} d_{i}^{out}}

S_{M_{i}} = D_{i}^{z_{i} d_{i}^{out}}

S_{M_{i}} = (D_{i}!)^{z_{i} d_{i}^{out}}

S_{M_{i}} = (D_{i}!)^{z_{i} d_{i}^{out}}

K_{i} = \frac{z _{i} !}{d _{i}^{in} ! ^{\frac{z _{i}}{d _{i}^{in}}}}^{d_{i}^{out}}

K_{i} = \frac{z _{i} !}{d _{i}^{in} ! ^{\frac{z _{i}}{d _{i}^{in}}}}^{d_{i}^{out}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\acsetup

tooltip=true \DeclareAcronymNN short = NN, long = neural network, tooltip = Neural Network

\DeclareAcronymMLP short = MLP, long = Multilayer Perceptron, tooltip = Multilayer Perceptron

\DeclareAcronymCNN short = CNN, long = Convolutional Neural Network, tooltip = Convolutional Neural Network

\DeclareAcronymFC short = FC, long = Fully-connected, tooltip = Fully-connected

\DeclareAcronymSC short = SC, long = Sparsely-connected, tooltip = Sparsely-connected

\DeclareAcronymPCA short = PCA, long = Principal Component Analysis, tooltip = Principal Component Analysis

\DeclareAcronymFF short = FF, long = Feedforward, tooltip = Feedforward

\DeclareAcronymBP short = BP, long = Backpropagation, tooltip = Backpropagation

\DeclareAcronymUP short = UP, long = Update of Trainable Parameters, tooltip = Update of Trainable Parameters

\DeclareAcronymMFCC short = MFCC, long = Mel-frequency Cepstral Coefficient, tooltip = Mel-frequency Cepstral Coefficient

\DeclareAcronymTPC short = TPC, long = Test Prediction Comparison, tooltip = Test Prediction Comparison

\DeclareAcronymCI short = CI, long = Confidence Interval, tooltip = Confidence Interval

\DeclareAcronymGPU short = GPU, long = Graphical Processing Unit, tooltip = Graphical Processing Unit

\DeclareAcronymCPU short = CPU, long = Central Processing Unit, tooltip = Central Processing Unit

\DeclareAcronymFPGA short = FPGA, long = Field Programmable Gate Array, tooltip = Field Programmable Gate Array

\DeclareAcronymLDPC short = LDPC, long = Low Density Parity Check, tooltip = Low Density Parity Check

\DeclareAcronymGCD short = $\mathrm{gcd}$ , long = Greatest Common Divisor, tooltip = Greatest Common Divisor

\DeclareAcronymLSS short = $\mathrm{LSS}$ , long = Learning Structured Sparsity, tooltip = Learning Structured Sparsity

\DeclareAcronymASR short = ASR, long = Automatic Speech Recognition, tooltip = Automatic Speech Recognition

Pre-Defined Sparse Neural Networks

with Hardware Acceleration

Sourya Dey, Kuan-Wen Huang, Peter A. Beerel, and Keith M. Chugg, The authors are with the Ming Hsieh Department of Electrical Engineering, University of Southern California, Los Angeles, CA, 90089 USA e-mail: {souryade, kuanwenh, pabeerel, chugg}@usc.eduManuscript submitted December 3, 2018.This work is partly supported by NSF, Software and Hardware Foundations, Grant 1763747.

Abstract

Neural networks have proven to be extremely powerful tools for modern artificial intelligence applications, but computational and storage complexity remain limiting factors. This paper presents two compatible contributions towards reducing the time, energy, computational, and storage complexities associated with multilayer perceptrons. Pre-defined sparsity is proposed to reduce the complexity during both training and inference, regardless of the implementation platform. Our results show that storage and computational complexity can be reduced by factors greater than 5X without significant performance loss. The second contribution is an architecture for hardware acceleration that is compatible with pre-defined sparsity. This architecture supports both training and inference modes and is flexible in the sense that it is not tied to a specific number of neurons. For example, this flexibility implies that various sized neural networks can be supported on various sized \acFPGAs.

Index Terms:

Machine learning, Neural network, Multilayer perceptron, Sparsity, Hardware Acceleration

I Introduction

Neural networks are critical drivers of new technologies such as computer vision, speech recognition, and autonomous systems. As more data have become available, the size and complexity of \acNNs has risen sharply with modern \acNNs containing millions or even billions of trainable parameters [1, 2]. These massive \acNNs come with the cost of large computational and storage demands. The current state of the art is to train large \acNNs on \acGPUs in the cloud – a process that can take days to weeks even on powerful \acGPUs [1, 2, 3] or similar programmable processors with multiply-accumulate accelerators [4]. Once trained, the model can be used for inference which is less computationally intensive and is typically performed on more general purpose processors (*i.e., *\acCPUs). It is increasingly desirable to run inference, and even some re-training, on embedded processors which have limited resources for computation and storage. In this regard, model reduction has been identified as a key to \acNN acceleration by several prominent researchers [5]. This is generally performed post-training to reduce the memory requirements to store the model for inference – *e.g., *methods for quantization, compression, and grouping parameters [6, 7, 8, 9].

Decreasing the time, computation, storage, and energy costs for training and inference is therefore a highly relevant goal. In this paper we present two compatible methods towards this end goal: (i) a method for introducing sparsity in the connection patterns of \acNNs, and (ii) a flexible hardware architecture that is compatible with training and inference-only operation and supports the proposed sparse \acNNs. Our approach to sparsifying a \acNN is extremely simple and results in a large reduction in storage and computational complexity both in training and inference modes. Moreover, this method is not tied to the hardware acceleration and provides the same benefits for training and inference in software under the current paradigm. The hardware architecture is massively parallel, but not tightly coupled to a specific \acNN architecture (*i.e., *not tied to the number of nodes in a layer). Instead, the architecture allows for maximum throughput for a given amount of circuit resources.

Our approach to making a \acNN sparse is to specify a sparse set of neuron connections prior to training and to hold this pattern fixed throughout training and inference. We refer to this method of simply excluding some fixed set of connections in the \acNN as pre-defined sparsity. There are several methods in the literature related to sparse \acNNs, but most do not reduce the computation and storage complexity associated with training, which is a primary goal of this work. One related concept is drop-out [10] where selected edges in the \acNN are not processed during some steps of the training process, but the final result is a \acFC \acNN for inference. Another set of approaches target producing a sparse \acNN for inference, but use \acFC \acNNs during training. Among these are pruning and trimming methods that post-process the trained \acNN to produce a sparse \acNN for inference mode [11, 12, 13]. As mentioned above, other methods have been proposed for reducing the complexity of performing inference on a trained \acFC \acNN such as quantization, compression, and grouping parameters [6, 7, 8, 9]. Other research has suggested a method of learning sparsity during training that begins training a fully-connected \acNN and uses a cost regularizer that promotes sparsity in the trained model [14]. Note that all of these methods do not substantially reduce the complexity of training and instead target inference models that have lower complexity. One method aimed at reducing both training and inference complexity is using \acNNs with structured, but not sparse, weight matrices [15, 16]. Finally, we note that several authors have very recently proposed pre-defined sparse \acNNs [17, 18, 19] independently of our published work [20, 21, 22].

Motivated by the fact that specialized hardware is typically faster and more energy efficient than \acGPUs and \acCPUs, there exists a large body of literature in \acNN hardware acceleration. The vast majority of this addresses only inference given a trained model [23, 9, 24, 25, 26], with few addressing hardware accelerated training [27]. The work of [27], for example, targets a specific size \acNN – *i.e., *the logic and memory architecture is tied to the number of neurons in a layer.

We propose an architecture that supports training, but can be simplified for inference-only mode, and is flexible to the \acNN size. This is particularly attractive for \acFPGA implementations. Specifically, the proposed architecture produces the maximum throughput on a given \acFPGA for a given \acNN and can therefore support various sized \acNNs on various sized \acFPGAs. This is accomplished by an edge-based processing architecture that can process $z$ edges in a given layer in parallel (*i.e., *we refer to $z$ as the degree of parallelism). A given FPGA can support some largest value of $z$ , and \acNNs with more edges will simply take more clock cycles to process.111We use the terms the terms ‘connection’ and ‘edge’ interchangeably, as we do with ‘node’ and ‘neuron’. Also, the term ‘cycle’ will mean ‘clock cycle’, unless otherwise stated.

Our edge-based architecture is inspired by architectures proposed for iterative decoding of modern sparse-graph-based error correction codes (*i.e., *Turbo and \acLDPC codes) (cf., [28, 29]). In particular, for a given processing task, there are $z$ logic units to perform the task and $z$ memories to store the quantities associated with the task. A challenge with this architecture, shared between the decoding and \acNN applications, is that, in order to achieve high-throughput without memory duplication, the parallel memories must be accessed in two manners: natural order and interleaved order. In natural order, each computation unit is associated with one memory and accesses the elements of that memory sequentially. For interleaved order access, the $z$ computational units must access the memories such that no memory is accessed more than once in a cycle. Such an addressing pattern is called clash-free, and this property ensures that no memory contention occurs so that no stalls or wait states are required. For modern codes, the clash-free property of the memories is ensured by defining clash-free interleavers (*i.e., *permutations) [30], or clash-free parity check matrices [29]. In the context of \acNNs, this clash-free property is tied to the connection patterns between layers of neurons.

In addition to $z$ degrees of parallelism in edge processing in a given layer, our architecture is pipelined across layers. Thus, there is a degree of parallelism associated with each layer (*i.e., * $z_{i}$ for layer $i$ ) selected to set the number of cycles required to process a layer to a constant – *i.e., *larger layers have larger $z$ so that the computation time of all layers is the same. For an $(L+1)$ -layer \acNN there are $L$ pipeline stages so that a given \acNN input is processed in the time it takes to complete the processing of the edges in a single layer. Furthermore, the three operations associated with training – \acFF, \acBP, and \acUP – are performed in parallel. The architecture may be simplified to perform only inference by eliminating the logic and memory associated with \acBP and \acUP. Furthermore, while the architecture supports the reduced sparse complexity \acNNs, it is also compatible with traditional \acFC networks. Interestingly, very recent work proposed pipelining across layers for an inference-only accelerator [31], as well as a scalable edge-based architecture for training [32] independently of our published work [20, 21]. Neither of these other recent works, however, takes advantage of pre-defined sparsity in the network.

In Section II we provide motivation for and simple examples of the effectiveness of pre-defined sparsity. In Section III the hardware architecture is described in detail, including defining a class of simple clash-free connection patterns with low address generation complexity. Section IV contains a detailed simulation study of pre-defined sparsity in \acNNs based on four different classification datasets – MNIST handwritten digits [33], Reuters news articles [34], TIMIT speech corpus [35], and CIFAR-100 images [36]. We identify a set of trends or design guidelines in this section as well. This section also demonstrates that the simple, hardware-compatible clash-free connection patterns provide performance on-par or better than that of randomly connected sparse patterns. Finally, in Section V we consider the issue of whether pre-defining the structured sparse patterns causes a significant performance loss relative to other sparse methods having similar amount of parameters. We find that there is no significant performance degradation and therefore our hardware architecture can provide training and inference performance commensurate with state-of-the art sparsity methods.

II Structured Pre-Defined Sparsity

II-A Definitions, Notation, and Background

An $(L+1)$ -layer \acMLP has $N_{i}$ nodes in the $i^{\mathrm{th}}$ layer, described collectively by the neuronal configuration $\bm{N}_{\mathrm{net}}=\left(N_{0},N_{1},\cdots,N_{L}\right)$ , where layer 0 is the input layer. We use the convention that layer $i$ is to the ‘right’ of layer $i-1$ . There are $L$ junctions between layers, with junction $i$ connecting the $N_{i-1}$ nodes of its left layer $i-1$ with the $N_{i}$ nodes of its right layer $i$ .

We define pre-defined sparsity as simply not having all $N_{i-1}N_{i}$ edges present in junction $i$ . Furthermore, we define structured pre-defined sparsity so that for a given junction $i$ , each node in its left layer has fixed out-degree – *i.e., * $d^{\mathrm{out}}_{i}$ connections to its right layer, and each node in its right layer has fixed in-degree – *i.e., * $d^{\mathrm{in}}_{i}$ connections from its left layer. \acFC \acNNs have $d^{\mathrm{out}}_{i}=N_{i}$ and $d^{\mathrm{in}}_{i}=N_{i-1}$ with $N_{i-1}N_{i}$ edges in the $i^{\mathrm{th}}$ junction, while a sparse \acNN has at least one junction with less than this number of edges. The number of edges (or weights) in junction $i$ is given by $|\bm{W}_{i}|=N_{i-1}d^{\mathrm{out}}_{i}=N_{i}d^{\mathrm{in}}_{i}$ . The density of junction $i$ is measured relative to \acFC and denoted as $\rho_{i}=|\bm{W}_{i}|/(N_{i-1}N_{i})$ . The structured constraint implies that the number of possible $\rho_{i}$ values is equal to the \acGCD of $N_{i-1}$ and $N_{i}$ , as shown in Appendix A. The overall density is

[TABLE]

Thus, specifying $\bm{N}_{\mathrm{net}}$ and the out-degree configuration $\bm{d}^{\mathrm{out}}_{\mathrm{net}}=(d^{\mathrm{out}}_{1},\cdots,d^{\mathrm{out}}_{L})$ determines the density of each junction and the overall density.

We will also consider random pre-defined sparsity, where connections are distributed randomly given preset $\rho_{i}$ values without constraints on in- and out-degrees. In Sec. IV-B we show that random pre-defined sparsity is undesirable at low densities because it may result in unconnected neurons.

The standard equations for \acFC \acNNs are well-known [37]. For a \acNN using structured pre-defined sparsity, only the weights corresponding to connected edges are stored in memory and used in computation. This leads to the modified equations (2a)–(7a), where subscripts denote layer/junction numbers, single superscripts denote neurons in a layer, and double superscripts denote (right neuron, left neuron) in a junction. The \acFF processing proceeds left-to-right and computes the activations $\bm{a}_{i}$ and associated derivatives $\dot{\bm{a}}_{i}$ for each layer by applying an activation function $\text{act}(\cdot)$ to a linear combination of biases $\bm{b}_{i}$ , junction weights $\bm{W}_{i}$ and preceding layer activations $\bm{a}_{i-1}$

[TABLE]

Note that (4a) is used in training, but is not required in inference mode. The \acBP computation is done only in training and computes a sequence of error values from right-to-left

[TABLE]

where $l^{(j)}\left(a_{L}^{(j)},y^{(j)}\right)$ is the $j^{\mathrm{th}}$ component of the loss function. Finally, stochastic gradient \acUP is given by

[TABLE]

where $\eta$ is the learning rate. The parameters on left-hand-side of (2a)–(7a) will be referred to as the network parameters, with the weights and biases being the trainable parameters.

II-B Motivation and Preliminary Examples

Pre-defined sparsity can be motivated by inspecting the histogram for trained weights in a \acFC \acNN. There have been previous efforts to study such statistics [3, 38], however, not for individual junctions. Fig. 1 shows weight histograms for each junction in both a 2-junction and 4-junction \acFC \acNN trained on the MNIST dataset. Note that many of the weights are zero or near-zero after training, especially in the earlier junctions. This motivates the idea that some weights in these layers could be set to zero (*i.e., *the edges excluded). Even with this intuition, it is unclear that one can pre-define a set of weights to be zero and let the \acNN learn around this pre-defined sparsity constraint. Fig. 1(c) and (h) show that, in fact, this is the case – *i.e., *this shows classification accuracy as a function of the overall density $\rho_{\mathrm{net}}$ for structured pre-defined sparsity. Since the computational and storage complexity is directly proportional to the number of edges in the \acNN, operating at an overall density of, for example, 50% results in a 2X reduction in complexity both during training and inference. Detailed numerical experiments in Section IV build on these simple examples. However, before we proceed to those results, it is important to consider a hardware architecture that can support structured pre-defined sparsity and consider the additional clash-free constraints placed on the connection patterns so that these can be considered in the studies in Section IV.

III Hardware Architecture

In this section we describe the proposed flexible hardware architecture outlined in the Introduction. The overall architectural view is captured by Fig. 2: sub-figure (a) shows parallel edge processing within a junction with degree of parallelism 3, (b) shows clash-free memory access, and (c) junction pipelining and parallel processing of the three operations – \acFF, \acBP, \acUP. The toy example in Fig. 2(a)-(b) is for $N_{i-1}=6$ , $N_{i}=3$ , $\rho_{i}=6/18=1/3$ , and $z_{i}=3$ . Fig. 2(a) shows that the $z_{i}=3$ blue edges are processed in parallel in one cycle, while the pink edges are processed in parallel during the next cycle. Fig. 2(b) shows how the $z_{i}=3$ \acFF processing logic units access the memories in natural and interleaved order. As described in detail in Sec. III-B, the interleaved order access may represent reading of the activations $\{a_{i-1}^{(j)}\}$ for $j\in\{0,1,5\}$ and the natural order access may correspond to writing the computed activations $\{a_{i}^{(j)}\}$ for $j\in\{0,1,2\}$ . On the next cycle, the remaining memory locations (*i.e., *the white cells) will be accessed. Note that this illustrates a clash-free connection pattern since each of the $z_{i}=3$ memories is accessed no more than once in each cycle – *i.e., *one hit per column on each access.

The junction-based operation in Fig. 2(b) is repeated for each junction in a pipeline. In particular, there are $L$ pipeline stages. For example, for the \acFF pipeline, while the first stage is processing input vector $n+L$ on junction 1, the second stage is processing input vector $n+L-1$ on junction 2. The degree of parallelism for each junction is selected so that the processing time for any operation (\acFF/\acBP/\acUP) is the same for each junction. Thus the throughput, *i.e., *the frequency of processing input samples, is determined by the time taken to perform a single operation in a single junction.

In summary, the architecture is (i) edge-based and not tied to a specific number of nodes in a layer, (ii) flexible in that the amount of logic is determined by the degree of parallelism which trades size for speed, and (iii) fully pipelined for the parallel operations associated with \acNN training. Also note that the architecture can be specialized to perform only inference by removing the logic and memory associated with the \acBP and \acUP operations, and the $\dot{\bm{a}}_{i}$ computation in (4a).

A key concern when implementing \acNNs on hardware is the large amount of storage required. Several characteristics regarding memory requirements guided us in developing the proposed architecture. Firstly, since weight memories are the largest, their number should be minimized. Secondly, having a few deep memories is more efficient in terms of power and area than having many shallow memories [39]. Thirdly, throughput should be maximized without duplicating memories, hence the need for clash-free connection patterns.

In Sec. III-A, we describe junction pipelining design which attempts to minimize weight storage resources. The memory organization within a junction is described in Sec. III-B, and is designed to minimize the number of memories for a given degree of parallelism. Finally, clash-free access conditions are developed in Sec. III-B and III-C, and a simple method for implementing such patterns given in Sec. III-C.

III-A Junction pipelining and Operational parallelism

Our edge-based architecture is motivated by the fact that all three operations – \acFF, \acBP, \acUP – use the same weight values for computation. Since $z_{i}$ edges are processed in parallel in a single cycle, the time taken to complete an operation in junction $i$ is $\left(C_{i}=\left|\bm{W}_{i}\right|/z_{i}\right)$ cycles. The degree of parallelism configuration $\bm{z}_{\mathrm{net}}=\left(z_{1},\cdots,z_{L}\right)$ is chosen to achieve $C_{i}=C\quad\forall\,i\in\{1,\cdots,L\}$ . This allows efficient junction pipelining since each operation takes exactly $C$ cycles to be completed for each input in each junction, which we refer to as a junction cycle.222During hardware implementation, a few extra cycles may be needed to flush the pipeline so that $C_{i}=\left|\bm{W}_{i}\right|/z_{i}+c_{i}$ . These are also balanced, *i.e., * $c_{i}=c\quad\forall~{}i\in\{1,\cdots,L\}$ , to achieve efficient pipelining. In our initial implementation [40], for example, $c=2$ and the junction cycle is $C=34$ . This determines throughput.

The following is an analysis of Fig. 2(c) in more detail for an example \acNN with $L=2$ . While a new training input numbered $n+3$ is getting loaded as $\bm{a}_{0}$ , junction 1 is processing the \acFF stage for the previous input $n+2$ and computing $\bm{a}_{1}$ . Simultaneously, junction 2 is processing \acFF and computing cost $\bm{\delta}_{L}$ via cost derivatives for input $n+1$ . It is also doing \acBP on input $n$ to compute $\bm{\delta}_{1}$ , as well as updating (\acUP) its parameters from the finished $\bm{\delta}_{L}$ computation of input $n$ . Simultaneously, junction 1 is performing \acUP using $\bm{\delta}_{1}$ from the finished \acBP results of input $n-1$ .333Note that \acBP does not occur in the first junction because there are no $\bm{\delta}_{0}$ values to be computed This results in operational parallelism in each junction, as shown in Fig. 3. The combined speedup is approximately a factor of $3L$ as compared to doing one operation at a time for a single input.

Notice from Fig. 3 that there is only one weight memory bank which is accessed for all three operations. However, \acUP in junction $1$ needs access to $\bm{a}_{0}$ for input $n-1$ , as per the weight update equation (8a). This means that there need to be $2L+1=5$ left activation memory banks for storing $\bm{a}_{0}$ for inputs $n-1$ to $n+3$ , *i.e., *a queue-like structure. Similarly, \acUP in junction 2 will need $2(L-1)+1=3$ queued banks for each of its left activation $\bm{a}_{1}$ and its derivative $\dot{\bm{a}}_{1}$ memories – for inputs from $n$ (for which values will be read) to $n+2$ (for which values are being computed and written). There also need to be 2 banks for all $\bm{\delta}$ memories – 1 for reading and the other for writing. Thus junction pipelining requires multiple memory banks, but only for layer parameters $\bm{a}$ , $\dot{\bm{a}}$ and $\bm{\delta}$ , not for weights.444This is achieved by making the weight memory dual-port, while $\bm{a}$ and $\dot{\bm{a}}$ are single-ported memories. The $\bm{\delta}$ memories are also dual-ported due to the exact manner in which we implemented this architecture on \acFPGA, refer to [40] for full details. The number of layer parameters is insignificant compared to the number of weights for practical networks. This is why pre-defined sparsity leads to significant storage savings, as quantified in Table I for the circled \acFC point vs the $\rho_{\mathrm{net}}=21\%$ point from Fig. 1(c). Specifically, memory requirements are reduced by 3.9X in this case. Furthermore, the computational complexity, which is proportional to the number of weights for a \acMLP, is reduced by 4.8X. For this example, these complexity reductions come at a cost of degrading the classification accuracy from $98.0\%$ to $97.2\%$ .

III-B Memory organization

For the purposes of memory organization, edges are numbered sequentially from top to bottom on the right side of the junction. Other network parameters such as $\bm{a}$ , $\dot{\bm{a}}$ and $\bm{\delta}$ are numbered according to the neuron numbers in their respective layer. Consider Fig. 4 as an example, where junction $i$ is flanked by $N_{i-1}=12$ left neurons with $d^{\mathrm{out}}_{i}=2$ and $N_{i}=8$ right neurons, leading to $\left|\bm{W}_{i}\right|=24$ and $d^{\mathrm{in}}_{i}=3$ . The three weights connecting to right neuron 0 are numbered 0, 1, 2; the next three connecting to right neuron 1 are numbered 3, 4, 5, and so on. A particular right neuron connects to some subset of left neurons of cardinality $d^{\mathrm{in}}_{i}$ .

Each type of network parameter is stored in a bank of memories. The example in Fig. 4 uses $z_{i}=4$ , *i.e., *4 weights are accessed per cycle. We designed the weight memory bank to have the minimum number of memories to prevent clashes, *i.e., * $z_{i}$ , and their depth equals $C_{i}$ . Weight memories are read in natural order – 1 row per cycle (shown in same color).

Right neurons are processed sequentially due to the weight numbering. The number of right neuron parameters of a particular type needing to be accessed in a cycle is upper bounded by $\left\lceil z_{i}/d^{\mathrm{in}}_{i}\right\rceil$ , which leads to $z_{i+1}\geq\left\lceil z_{i}/d^{\mathrm{in}}_{i}\right\rceil$ in order to prevent clashes in the right memory bank.555This does not limit most practical designs (see Appendix B). For \acFF in Fig. 4 for example, cycles 0 and 1 finish computation of $a_{i}^{(0)}$ and $a_{i}^{(1)}$ respectively, while cycle 2 finishes computing both $a_{i}^{(2)}$ and $a_{i}^{(3)}$ . For \acBP or \acUP, everything remains same except for the right memory accesses. Now $\delta_{i}^{(0)}$ and $\delta_{i}^{(1)}$ are used in cycle 0, $\delta_{i}^{(1)}$ and $\delta_{i}^{(2)}$ in cycle 1, and $\delta_{i}^{(2)}$ and $\delta_{i}^{(3)}$ in cycle 2. Thus the maximum number of right neuron parameters ever accessed in a cycle is $\left\lceil z_{i}/d^{\mathrm{in}}_{i}\right\rceil=2$ .

Since edges are interleaved on the left, in general, the $z_{i}$ edge processing logic units will need access to $z_{i}$ parameters of a particular type from layer $i-1$ . So all the left memory banks have $z_{i}$ memories, each of depth $D_{i}=N_{i-1}/z_{i}$ , which are accessed in interleaved order. For example, after $D_{i}$ cycles, $N_{i-1}$ edges have been processed – *i.e., * $\left(D_{i}\times z_{i}\right)=N_{i-1}$ . We require that each of these edges be connected to a different left neuron to eliminates the possibility of duplicate edges. This completes a sweep, *i.e., *one complete access of the left memory bank. Since each left neuron connects to $d^{\mathrm{out}}_{i}$ edges, $d^{\mathrm{out}}_{i}$ sweeps are required to process all the edges, *i.e., *each left activation is read $d^{\mathrm{out}}_{i}$ times in the whole junction cycle. The reader can verify that $D_{i}$ cycles multiplied by $d^{\mathrm{out}}_{i}$ sweeps results in $C_{i}$ total cycles, *i.e., *one junction cycle.

III-C Clash-free connection patterns

We define a clash as attempting to perform a particular operation more than once on the same memory at the same time, which would stall processing.666For single-ported memories, attempting two reads or two writes or a read and a write in the same cycle is a clash. For simple dual-ported memories with one port exclusively for reading and the other exclusively for writing, a read and a write can be performed in the same cycle. Attempting to perform two reads or two writes in the same cycle is a clash. The idea of clash-freedom is to pre-define a pattern of connections and $z$ values such that no operation in any junction of the \acNN results in a clash. Sec. III-B described how $z$ values should be designed to prevent clashes in the weight and right memory banks.

This subsection analyzes the left memory banks, which are accessed in interleaved order. Their memory access pattern should be designed so as to prevent clashes. Additionally, the following properties are desired for practical clash-free patterns. Firstly, it should be easy to find a pattern that gives good performance. Secondly, the logic and storage required to generate the left memory addresses should be low complexity.

We generate clash-free patterns by initially specifying the left memory addresses to be accessed in cycle 0 using a seed vector $\bm{\phi}_{i}\in\{0,1,\cdots,D_{i}-1\}^{z_{i}}$ . Subsequent addresses are cyclically generated. Considering Fig. 4 as an example, $\bm{\phi}_{i}=(1,0,2,2)$ . Thus in cycle 0, we access addresses $(1,0,2,2)$ from memories $(M0,M1,M2,M3)$ , *i.e., *left neurons $(4,1,10,11)$ . In cycle 1, the accessed addresses are $(2,1,0,0)$ , and so on. Since $D_{i}=3$ , cycles 3–5 access the same left neurons as cycles 0–2.

We found that this technique results in a large number of possible connection patterns, as discussed in Appendix C. Randomly sampling from this set results in performance comparable with non-clash-free \acNNs, as shown in Sec. IV-B. Finally, our approach only requires storing $\bm{\phi}_{i}$ and using $z_{i}$ incrementers to generate subsequent addresses. This approach is similar to methods used in modern coding to allow parallel processing and memory accesses, c.f. [28, 30, 29]. Other techniques to generate clash-free connection patterns are discussed in Appendix C.

III-D Batch size

It is common in training of \acNNs to use minibatches. For a batch size of $M$ , the \acUP operation in (7a) is performed only once for $M$ inputs by using the average over the $M$ gradients. Our architecture performs an \acUP for every input and therefore may be viewed as having batch size one. However, the processing in our architecture differs from a typical software implementation with $M=1$ due to the pipelined and parallel operations. Specifically, in our architecture, \acFF and \acBP for the same input use different weights, as implied by Fig. 2(c). In results not presented here, we found no performance degradation due to this variation from the standard backpropagation algorithm. There is considerable ambiguity in the literature regarding ideal batch sizes [41, 42], and we found that our current network architecture performed well in our initial hardware implementation [40]. However, if a more conventional batch size is desired, the \acUP logic can be removed from the junction pipeline and the \acUP operation performed once every $M$ inputs. This would eliminate some arithmetic units at the cost of increased storage for accumulating intermediate values from (7a).

III-E Special Case: Processing a \acFC junction

Fig. 5 shows the \acFC version of the junction from Fig. 4, which has 96 edges to be accessed and operated on. This can be done keeping the same junction cycle $C_{i}=6$ by increasing $z_{i}$ to 16, *i.e., *using more hardware. On the other hand, if hardware resources are limited, one can use the same $z_{i}=4$ and pay the price of a longer junction cycle $C_{i}=24$ , as shown in Fig. 5. This demonstrates the flexibility of our architecture.

Note that \acFC junctions are clash-free in all practical cases due to the following reasons. Firstly, the left memory accesses are in natural order just like the weights, which ensures that no more than one element is accessed from each memory per cycle. Secondly, $\left\lceil z_{i}/d^{\mathrm{in}}_{i}\right\rceil=1$ for all practical cases since $z_{i}\leq N_{i-1}$ , as discussed in Appendix B, and $d^{\mathrm{in}}_{i}=N_{i-1}$ for \acFC junctions. This means that at most one right neuron is processed in a cycle777In Fig. 5 for example, one right neuron finishes processing every $3^{\mathrm{rd}}$ cycle, so clashes will never occur when accessing the right memory bank.

Note that compared to Fig. 4, the weight memories in Fig. 5 are deeper since $C_{i}$ has increased from 6 to 24. However, the left layer memories remain the same size since $N_{i-1}=12$ and $z_{i}=4$ are unchanged, but the left memory bank is accessed more times since the number of sweeps has increased from 2 to 8. Also note that even if cycle 0 (blue) accesses some other clash-free subset of left neurons, such as $\{4,5,6,7\}$ instead of $\{0,1,2,3\}$ , the connection pattern would remain unchanged. This implies that different memory access patterns do not necessarily lead to different connection patterns; as discussed further in Appendix C.

IV Observed Trends of Pre-Defined Sparsity

This section analyzes trends observed when experimenting with several different datasets via software simulations. We intend the following four trends to provide guidelines on designing pre-defined sparse \acNNs.

Hardware-compatible, clash-free, pre-defined sparse patterns perform at least as well as other pre-defined sparse patterns (*i.e., *random and structured) (Sec. IV-B). 2. 2.

The performance of pre-defined sparsity is better on datasets that have more inherent redundancy (Sec. IV-C). 3. 3.

Junction density should increase to the right: junctions closer to the output should generally have more connections than junctions closer to the input (Sec. IV-D). 4. 4.

Larger and more sparse \acNNs are better than smaller and denser \acNNs, given the same number of layers and trainable parameters. Specifically, ‘larger’ refers to more hidden neurons (Sec. IV-E).

The remainder of this section first describes the datasets we experimented on, and then examines these trends in detail.

IV-A Datasets and Experimental Configuration

Unless otherwise noted, the following parameters and configurations listed below were used for all presented results.

MNIST handwritten digits

We rasterized each input image into a single layer of 784 features888On certain occasions we added 16 input features which are always trivially 0 so as to get 800 features for each input. This leads to easier selection of different sparse network configurations., *i.e., *the permutation-invariant format. No data augmentation was applied.

Reuters RCV1 corpus of newswire articles

The classification categories are grouped in a tree structure. We used preprocessing techniques similar to [43] to isolate articles which fell under a single category at the second level of the tree. We finally obtained 328,669 articles in 50 categories, split into $50,000$ for validation, $100,000$ for test, and the remaining for training. The original data has a list of token strings for each story, for example, a story on finance would frequently contain the token ‘financ’. We chose the most common 2000 tokens and computed counts for each of these in each article. Each count $x$ was transformed into $\text{log}(1+x)$ to form the final 2000-dimensional feature vector for each input.

TIMIT speech corpus

TIMIT is a speech dataset comprising approximately $5.4$ hours of 16 kHz audio commonly used in \acASR. A modern \acASR system has three major components: (i) preprocessing and feature extraction, (ii) acoustic model, and (iii) dictionary and language model. A complete study of an ASR system is beyond the scope of this work. Instead we focus on the acoustic model which is typically implemented using a \acNN. The input to the acoustic model is feature vectors and the output is a probability distribution on phonemes (*i.e., *speech sounds). For our experiments, we used 25ms speech frames with 10ms shift, as in [43], and computed a feature vector of 39 \acMFCCs for each frame. We used the complete training set of $818,837$ training samples (462 speakers), $89,319$ validation samples (50 speakers), and $212,093$ test samples (118 speakers). We used a phoneme set of size 39 as defined in [44].

CIFAR-100 images

Our setup for CIFAR-100 consists of a \acCNN followed by a \acMLP. The \acCNN has 3 blocks and each block has 2 convolutional layers with window size 3x3 followed by a max pooling layer of pool size 2x2. The number of filters for the six convolutional layers is (60,60, 125,125, 250,250). This results in a total of approximately one million trainable parameters in the convolutional portion of the network. Batch normalization is applied before activations. The output from the 3rd block, after flattening into a vector, has 4000 features. Typically dropout is applied in the \acMLP portion, however we omitted it there since pre-defined sparsity is an alternate form of parameter reduction. Instead we found that a dropout probability of half applied to the convolutional blocks improved performance. No data augmentation was applied.

For each dataset, we performed classification using one-hot labels and measured accuracy on the test set as a performance metric.999The \acNN in a complete \acASR system would be a ‘soft’ classifier and feed the phoneme distribution outputs to a decoder to perform ‘hard’ final classification decisions. Therefore for TIMIT, we computed another performance metric called \acTPC, measured as KL divergence between predicted test output probability distributions of sparse vs the respective \acFC case. Performance results obtained using \acTPC were qualitatively very similar to test accuracy and not shown here. We also calculated the top-5 test set classification accuracy for CIFAR-100.

We found the optimal training configuration for each \acFC setup by doing a grid search using validation performance as a metric. This resulted in choosing ReLU activations for all layers except for the final softmax layer. The initialization proposed by He et al. [45] worked best for the weights; while for biases, we found that an initial value of $0.1$ worked best in all cases except for Reuters, for which zeroes worked better. The Adam optimizer [46] was used with all parameters set to default, except that we set the decay parameter to $10^{-5}$ for best results. We used a batch size of 1024 for TIMIT and Reuters since the number of training samples is large, and 256 for MNIST and CIFAR.

All experiments were run for 50 epochs of training and regularization was applied as an L2 penalty to the weights. To maintain consistency, we kept most hyperparameters the same when sparsifying the network, but reduced the L2 penalty coefficient with increasing sparsity. This was done because sparse \acNNs have fewer trainable parameters and are less prone to overfitting. We ran each experiment at least five times to average out randomness and we show the 90% \acCIs for each metric as shaded regions (this also holds for the results in Fig. 1(c,h)). In addition to the results shown, we developed a data set of Morse code symbol sequences and investigated pre-defined sparse \acNNs. While these results are excluded for brevity, they are consistent with the trends described in this Section, and can be found in [47].

IV-B Comparison of Pre-Defined Sparse Methods

Table II shows performance on different datasets for three methods of pre-defined sparsity: a) the most restrictive and hardware-friendly clash-freedom, b) structured, and c) random. For the clash-free case, we experimented with different $\bm{z}_{\mathrm{net}}$ settings to simulate different hardware environments:

•

Reuters: One junction cycle is 50 cycles for all the different densities. This is because we scale $\bm{z}_{\mathrm{net}}$ accordingly, *i.e., *a more powerful hardware device is used for each \acNN as $\rho_{\mathrm{net}}$ increases.

•

CIFAR-100 and MNIST: These simulate cases where hardware choice is limited, such as a high-end, a mid-range and a low-end device being available. Thus three different $\bm{z}_{\mathrm{net}}$ values are used for CIFAR-100 depending on $\rho_{\mathrm{net}}$ .

•

TIMIT: We keep $\bm{z}_{\mathrm{net}}$ constant for different densities. Junction cycle length varies from 90 cycles for $\rho_{\mathrm{net}}=7.69\%$ to 810 for $\rho_{\mathrm{net}}=69.23\%$ . This shows that when limited to a single low-end hardware device, denser \acNNs can be processed in longer time by simply changing $\bm{z}_{\mathrm{net}}$ .

Table II confirms that hardware-friendly clash-free pre-defined sparse architectures do not lead to any statistically significant performance degradation. We also observed that random pre-defined sparsity performs poorly for very low density networks, as shown by the blue values. This is possibly because there is non-negligible probability of neurons getting completely disconnected, leading to irrecoverable loss of information.

IV-C Dataset Redundancy

Many machine learning datasets have considerable redundancy in their input features. For example, one may not need information from the $\sim$ 800 input features of MNIST to infer the correct image class. We hypothesize that pre-defined sparsity takes advantage of this redundancy, and will be less effective when the redundancy is reduced. To test this, we changed the feature vector for each dataset as follows. For MNIST, \acPCA was used to reduce the feature count to the least redundant 200. For Reuters, the number of most frequent tokens considered as features was reduced from 2000 to 400. For TIMIT, we both reduced and increased the number of \acMFCCs by 3X to 13 and 117, respectively. Note that the latter increases redundancy. For CIFAR-100, a source of redundancy is the depth of the \acCNN, which extracts features and discriminates between classes before the \acMLP performs final classification. In other words, the \acCNN eases the burden of the \acMLP. So a way to reduce redundancy and increase the classification burden of the \acMLP is to lessen the effectiveness of the \acCNN by reducing its depth. Accordingly, we used a single convolutional layer with 250 filters of window size $5\times 5$ followed by a $8\times 8$ max pooling layer. This results in the same number of features, 4000, at the input of the \acMLP as the original network, but has reduced redundancy for the \acMLP.

Classification performance results are shown in Fig. 6 as a function of $\rho_{\mathrm{net}}$ . For MNIST and CIFAR-100, the performance degrades more sharply with reducing $\rho_{\mathrm{net}}$ for the nets using the reduced redundancy datasets. To explore this further, we recreated the histograms from Fig. 1 for the reduced redundancy datasets, *i.e., *a \acFC \acNN with $\bm{N}_{\mathrm{net}}=(200,100,10)$ training on MNIST after \acPCA. We observed a wider spread of weight values, implying less opportunity for sparsification (*i.e., *fewer weights were close to zero). Similar trends are less discernible for Reuters and TIMIT, however, reducing redundancy led to worse performance overall.

The results in Fig. 6 further demonstrate the effectiveness of pre-defined sparsity in greatly reducing network complexity with negligible performance degradation. For example, even the reduced redundancy problems perform well when operating with half the number of connections. For CIFAR in particular, \acFC performs worse than an overall \acMLP density of around 20%. Thus, in addition to reducing complexity, structured pre-defined sparsity may be viewed as an alternative to dropout in the \acMLP for the purpose of improving classification performance.

IV-D Individual junction densities

The weight histograms in Fig. 1 indicate that latter junctions, particularly junction $L$ closest to the output, have a wide spread of weight values. This suggests that a good strategy for reducing $\rho_{\mathrm{net}}$ would be to use lower densities in earlier junctions – *i.e., * $\rho_{1}<\rho_{L}$ . This is demonstrated in Fig. 7 for the cases of MNIST, CIFAR-100 and Reuters, each with $L=2$ junctions in their \acMLPs. Each curve in each subfigure is for a fixed $\rho_{2}$ , *i.e., *reducing $\rho_{\mathrm{net}}$ across a curve is done solely by reducing $\rho_{1}$ . For a fixed $\rho_{\mathrm{net}}$ , the performance improves as $\rho_{2}$ increases. For example, the circled points in Reuters both have $\rho_{\mathrm{net}}=4\%$ , but the starred point with $\rho_{2}=100\%$ has approximately $40\%$ better test accuracy than the pentagonal point with $\rho_{2}=2\%$ . The trend clearly holds for MNIST and is also discernible for CIFAR-100.

We further observed that this trend (*i.e., * $\rho_{i+1}>\rho_{i}$ should hold) is related to the redundancy inherent in the dataset and may not hold for datasets with very low levels of redundancy. To explore this, results analogous to those in Fig. 7 are presented in Fig. 8 for TIMIT, but with varying sized \acMFCC feature vectors – *i.e., *datasets corresponding to larger feature vectors will contain more redundancy. The results in Fig. 8(c) are for 117 dimensional \acMFCCs and are consistent with the trend in Fig. 7. However, for a \acMFCC dimension of 13, this trend actually reverses – *i.e., *the junction 1 should have higher density. This is shown in Fig. 8(b), where each curve is for a fixed $\rho_{1}$ . This reversed trend is also observed for the case of 39 dimensional feature vectors, considered in Fig. 8(a), where $\bm{N}_{\mathrm{net}}=(39,390,39)$ . Due to this symmetric neuronal configuration, for each value of $\rho_{\mathrm{net}}$ on the x-axis in Fig. 8(a), the two curves have complementary values of $\rho_{1}$ and $\rho_{2}$ ( $\rho_{1}\neq\rho_{2}$ ) – *e.g., *the two curves at $\rho_{\mathrm{net}}=7.69\%$ have $(\rho_{1},\rho_{2})$ pairs of $(2.56\%,12.82\%)$ and $(12.82\%,2.56\%)$ . We observe that the curve for $\rho_{1}<\rho_{2}$ is generally worse than the curve for $\rho_{2}<\rho_{1}$ , which indicates that junction 1 should have higher density in this case.

Fig. 8(d) depicts the results for Reuters with the feature vector size reduced to 400 tokens. While junction 2 is still more important (as in Fig. 7(c) for the original Reuters dataset), notice the circled star-point at the very left of the $\rho_{2}=100\%$ curve. This point has very low $\rho_{1}$ . Unlike Fig. 7(c), it crosses below the other curves, indicating that it is more important to have higher density in the first junction with this less redundant set of features. We observed a similar, but less prominent, trend in MNIST \acPCA when the feature dimension was reduced to 200.

In summary, if an individual junction density falls below a certain value, referred to as the critical junction density, it will adversely affect performance regardless of the density of other junctions. This explains why some of the curves cross in Fig. 8. The critical junction density is much smaller for earlier junctions than for later junctions in most datasets with sufficient redundancy. However, the critical density for earlier junctions increases for datasets with low redundancy.

IV-E ‘Large and sparse’ vs ‘small and dense’ networks

We observed that when keeping the total number of trainable parameters the same, sparser \acNNs with larger hidden layers (*i.e., *more neurons) generally performed better than denser networks with smaller hidden layers. This is true as long as the larger \acNN is not so sparse that individual junction densities fall below the critical density, as explained in Sec. IV-D. While the critical density is problem-dependent, it is usually low enough to obtain significant complexity savings above it. Thus, ‘large and sparse’ is better than ‘small and dense’ for many practical cases, including \acNNs with more than one hidden layer (*i.e., * $L>2$ ).

Fig. 9 shows this for networks having one and three hidden layers trained on MNIST. For the three layer network, all hidden layers have the same number of neurons. Each solid curve shows classification performance vs $\rho_{\mathrm{net}}$ for a particular $\bm{N}_{\mathrm{net}}$ , while the black dashed curves with identical markers are configurations that have approximately the same number of trainable parameters. As an example, the points with circular markers (with a big blue ellipse around them) in Fig. 9(b) all have the same number of trainable parameters and indicate that the larger, more sparse \acNNs perform better. Specifically, the network with $\bm{N}_{\mathrm{net}}=(784,112,112,112,10)$ and $\bm{d}^{\mathrm{out}}_{\mathrm{net}}=(10,10,10,10)$ corresponding to $\rho_{\mathrm{net}}$ $=9.82\%$ performs significantly better than the \acFC network with $\bm{N}_{\mathrm{net}}=(784,14,14,14,10)$ , and other smaller and denser networks, despite each having $11500$ trainable parameters. Increasing the network size further to $\bm{N}_{\mathrm{net}}=(784,224,224,224,10)$ , and reducing $\rho_{\mathrm{net}}$ to $4\%$ to fix the number of trainable parameters at $11500$ , leads to performance degradation. This is because this $\rho_{\mathrm{net}}$ was achieved by setting $\rho_{2}=\rho_{3}=2.68\%$ , which appears to be below the critical density.

Fig. 10 summarizes the analogous experiment on Reuters with similar conclusions. Both subfigures are for the same results with the x-axis split into higher and lower density range (on log scale), to show more detail. Observe that the trend of ‘large and sparse’ being better than ‘small and dense’ holds for subfigure (a), but reverses for (b) since densities are very low (the black dashed curves have positive slope instead of negative). This is due to the critical density effect.

Fig. 11(a) shows the result for the same experiment on TIMIT with four hidden layers111111We also performed experiments on TIMIT with one hidden layer ( $L=2$ ) and Reuters with 2 hidden layers ( $L=3$ ). Results were similar to those shown, so are not shown for brevity’s sake.. The trend is less clearly discernible, but it exists. Notice how the black dashed curves have negative slopes at appreciable levels of $\rho_{\mathrm{net}}$ , indicating ‘large and sparse’ being better than ‘small and dense’, but high positive slopes at low $\rho_{\mathrm{net}}$ , indicating the rapid degradation in performance as density is reduced beyond the critical density. This is exacerbated by the fact that TIMIT with 39 \acMFCCs is a dataset with low redundancy, so the effects of very low $\rho_{\mathrm{net}}$ are better observed.

Fig. 11(b) for the \acMLP portion of CIFAR-100 shows similar results as TIMIT, but on a log x-scale for more clarity. As noted in Sec. IV-C, the best performance for a given $\bm{N}_{\mathrm{net}}$ occurs at an overall density less than $100\%$ . It appears that for any $\bm{N}_{\mathrm{net}}$ for CIFAR-100, peak performance occurs at around $10$ – $20\%$ overall \acMLP density. In experiments not shown here, we obtained similar results for the reduced redundancy net with a single convolutional layer.

V Comparison to Other Sparse \acNN Methods

Numerical results in Sec. IV showed that hardware-compatible clash-free connection patterns performed as well as structured and random pre-defined sparse connections. In this section, we compare clash-free patterns against two sparsity approaches that are less constrained than the structured pre-defined sparsity considered in Sec. IV. In particular, both approaches remove the constraint of regular degree – *i.e., *these approaches yield sparse \acNNs that have varying $d^{\mathrm{out}}_{i}$ and $d^{\mathrm{in}}_{i}$ selected to optimize classification performance.

V-A Attention-based Preprocessed Sparsity

Previous works [48, 49] have applied the concept of attention on object recognition and image captioning to achieve better performance with fewer parameters and less computation. We simplify this idea by computing the variance of input features as attention and setting the out-degree of the neurons of the input layer based on this value, Specifically, the feature variances are quantized into three levels, and input neurons with higher attention are assigned more connections than those with lower attention. For the neurons in latter layers, we use uniform out-degree and in-degree.

V-B Learning Structured Sparsity during Training

While the method in Sec. V-A obtains a non-uniform neuron out-degree for the first layer, it only considers the properties of the dataset and not the learning process. We also compared against the method of \acLSS which learns a good sparse connection pattern during training. This method was proposed in [14] and prunes the connections during training by using a sparse-promoting penalty function as part of the objective function. Example penalty functions include L1 and L1/L2 used in Lasso[50] and group-Lasso[51], respectively. During training, the optimizer minimizes a balancing objective comprising the loss function $l(\cdot)$ 121212Here we emphasize that the loss function depends on all of the trainable parameters in the network, as opposed to the output layer activations and ground truth labels as done in Sec. II-A. This is to emphasize that loss is a function of all of the trainable parameters and therefore the loss function can promote sparsity by driving some edge weights to zero., the regularizer $r(\cdot)$ , and a sparse-promoting penalty function $p(\cdot)$ ,

[TABLE]

where the penalty coefficients $\{\gamma_{i}\}_{i=1}^{L}$ control the density of each junction. Increasing $\gamma_{i}$ decreases $\rho_{i}$ , however, obtaining a specific value of $\rho_{i}$ requires experimental tuning of $\gamma_{i}$ . In the results presented in this section, we used L1 as the element-wise sparse-promoting penalty function and L2 as the regularizer. Note that, in contrast to the attention-based method and the structured pre-defined sparsity approach, \acLSS is not a pre-defined sparsity method. Instead training in \acLSS begins with a \acFC network, which means that training complexity is similar to that of a \acFC \acNN. At the end of the \acLSS training process, weights with absolute value below a threshold are set as zero to achieve the target density.

V-C Performance comparison

Fig. 12 compares performance versus $\rho_{\mathrm{net}}$ of different sparse \acNNs on MNIST, Reuters, and TIMIT. The individual density of each junction with the attention-based preprocessed sparse method is set to be identical to the density of each junction using clash-free pre-defined sparse method. However, the density of the nets using the \acLSS method can be tuned only with the penalty coefficients. We tuned these to approximate match the density of the other methods.131313This is why $\rho_{\mathrm{net}}$ values of the green curves do not perfectly align with the pre-defined sparsity curves.

The \acLSS method performs best among all sparse methods, which is to be expected as it is the least constrained and also discovers a good sparse connection pattern during training. However, the performance with clash-free pre-defined sparsity is near that of the attention-based and \acLSS methods – *i.e., *within $2\%$ in terms of test accuracy at $\rho_{\mathrm{net}}=20\%$ . We conclude that even though the clash-free patterns are highly structured and pre-defined, there is no significant performance degradation when compared to advanced methods for producing sparse models by exploiting specific properties of the dataset or learning sparse patterns during training.

VI Conclusions and Future Work

In this work we proposed a new technique for complexity reduction of neural networks – pre-defined sparsity – in which a fixed sparse connection pattern is enforced prior to training and held fixed during both training and inference. We presented a hardware architecture suited to leverage the benefits of structured pre-defined sparsity, capable of parallel and pipelined processing. The architecture can be used for both training and inference modes, and supports networks of arbitrary density, including conventional fully-connected ones. Flexibility is afforded by the degree of parallelism $\bm{z}_{\mathrm{net}}$ , which trades hardware complexity for speed. Simple methods for clash-free memory access are presented and these methods are shown to achieve performance on par with the best known methods for obtaining sparse \acMLPs.

Using extensive numerical experiments, we identified trends which help in designing pre-defined sparse networks. Firstly, it is better to allocate connections in a structured manner rather than randomly. Secondly, for most datasets with high redundancy, earlier junctions can be made more sparse. Thirdly, it is better to have more neurons in the hidden layers, and then sparsify aggressively to keep the number of edges low and reduce complexity.

As motivated in the Introduction, the rapidly growing complexity associated with modern \acNNs is a major challenge. Pre-defined sparsity is a simple method to help address this challenge, as is acceleration with custom hardware. Interesting areas for future research include analytical approaches to justify the trends observed in this work and improving our initial hardware implementation in [40]. It is also interesting to consider extending the methods introduced herein to convolutional layers and recurrent architectures. Finally, truly speeding the training process by orders of magnitude would allow more extensive search over \acNN architectures and therefore a better understanding of the largely empirical process of \acNN design.

Appendix A Structured Pre-Defined Sparsity Constraints

In our structured pre-defined sparse network, $\rho_{i}$ , the density of junction $i$ , cannot be arbitrary, since $\rho_{i}=d^{\mathrm{out}}_{i}/N_{i}=d^{\mathrm{in}}_{i}/N_{i-1}$ , where $d^{\mathrm{out}}_{i}$ and $d^{\mathrm{in}}_{i}$ are natural numbers satisfying the equation $N_{i-1}d^{\mathrm{out}}_{i}=N_{i}d^{\mathrm{in}}_{i}$ . Therefore, the number of possible $\rho_{i}$ values is the same as the number of $\left(d^{\mathrm{out}}_{i},d^{\mathrm{in}}_{i}\right)$ values satisfying the structured pre-defined sparsity constraints:

[TABLE]

where $\mathbb{N}$ denotes the set of natural numbers.

The smallest value of $d^{\mathrm{in}}_{i}$ which satisfies $d^{\mathrm{out}}_{i}\in\mathbb{N}$ is $N_{i-1}/\mathrm{gcd}(N_{i-1},N_{i})$ , and other values are its integer multiples. Since $d^{\mathrm{in}}_{i}$ is upper bounded by $N_{i-1}$ , the total number of possible $\left(d^{\mathrm{out}}_{i},d^{\mathrm{in}}_{i}\right)$ is $\mathrm{gcd}(N_{i-1},N_{i})$ . Thus, the set of possible $\rho_{i}$ is

[TABLE]

As a concrete example, consider a \acNN with $\bm{N}_{\mathrm{net}}=(117,390,13)$ . The number of possible densities of the junctions are determined by $\mathrm{gcd}(117,390)=39$ and $\mathrm{gcd}(390,13)=13$ . Therefore, the sets of junction densities are

[TABLE]

Appendix B Hardware Architecture Constraints

The depth of left memories in our hardware architecture is $D_{i}=N_{i-1}/z_{i}$ . Thus $N_{i-1}$ should preferably be an integral multiple of $z_{i}$ . This is not a burdening constraint since the choice of $z_{i}$ is independent of network parameters and depends on the capacity of the device. In the unusual case that this constraint cannot be met, the extra cells in memories can be filled with dummy values such as 0.

There are also 2 conditions placed on the $z$ values to eliminate stalls in processing: for all layers $i\in\{1,\cdots,L\}$ , (i) $\left|\bm{W}_{i}\right|/z_{i}=C$ , and (ii) $z_{i+1}\geq\left\lceil z_{i}/d^{\mathrm{in}}_{i}\right\rceil$ . Using the definitions from Sec. II-A, (i) is equivalent to $z_{i+1}=z_{i}d^{\mathrm{out}}_{i+1}/d^{\mathrm{in}}_{i}$ . Then, (ii) can be equivalently written as

[TABLE]

which needs to be satisfied $\forall~{}i\in\{1,\cdots,L-1\}$ . In practice, it is desirable to design $z_{i}/d^{\mathrm{in}}_{i}$ to be an integer so that an integral number of right neurons finish processing every cycle. This simplifies hardware implementation by eliminating the need for additional storage, for example, of the intermediate activation values during \acFF. In this case, (13) reduces to $d^{\mathrm{out}}_{i+1}\geq 1$ , which is always true.

For non-integral $z_{i}/d^{\mathrm{in}}_{i}$ , there are two cases. If $z_{i}>d^{\mathrm{in}}_{i}$ , (13) reduces to $d^{\mathrm{out}}_{i+1}\geq 2$ . On the other hand, if $z_{i}<d^{\mathrm{in}}_{i}$ , there is no bound on the right hand side of (13). In general, note that (13) becomes a burdening constraint only if $d^{\mathrm{in}}_{i}$ is large, and $d^{\mathrm{out}}_{i+1}$ and $z_{i}$ are both desired to be small. This corresponds to earlier junctions being denser than later, which is typically not desirable according to the observations in Sec. IV-D, or to very limited hardware resources. We thus conclude that (13) is not a limiting constraint in most practical cases.

Appendix C Clash-Free Patterns

Specifying $N_{i-1}$ , $N_{i}$ , $d^{\mathrm{in}}_{i}$ and $z_{i}$ for junction $i$ in a clash-free structured pre-defined sparse \acNN does not uniquely define a connection pattern (unless it is \acFC). This section discusses the number of possible left memory access patterns $S_{M_{i}}$ for such a junction $i$ . Note that the total number of possible memory access patterns for the complete \acNN is $S_{M}=\prod_{i=1}^{L}{S_{M_{i}}}$ .

When $z_{i}\geq d^{\mathrm{in}}_{i}$ , which is expected to be true for practical cases of implementing sparse \acNNs on powerful hardware devices, $S_{M_{i}}$ is also equal to the number of possible connection patterns $S_{C_{i}}$ , which is the key quantity of interest. This is because if $z_{i}\geq d^{\mathrm{in}}_{i}$ , at least one right neuron is completely processed in some cycle. Thus, changing the left memory access pattern will change the left neurons to which that right neuron connects, thereby changing the connection pattern. This one-to-one correspondence results in $S_{M_{i}}=S_{C_{i}}$ .

For the case of $z_{i}<d^{\mathrm{in}}_{i}$ , a \acFC junction provides an example where $S_{M_{i}}\neq S_{C_{i}}$ . Specifically, in this case $S_{C_{i}}=1$ as there is only one way to fully connect all neurons, but there are many clash-free memory access patterns, as shown in the following equations (14)-(16).

We now discuss various types of clash-freedom, and $S_{M_{i}}$ arising from each:

•

Type 1: This is as described in Sec. III-C, and recapitulated in Fig. 13(a). $S_{M_{i}}$ is the number of ways of designing $\bm{\phi}_{i}$ , i.e.,

[TABLE]

•

Type 2 (implemented in our earlier work [40]): In this technique, a new $\bm{\phi}_{i}$ is defined for every sweep. Considering the example in Fig. 13(b), $\bm{\phi}_{i}=(1,0,2,2)$ for sweep 0, but $(2,0,0,0)$ for sweep 1. There will be $d^{\mathrm{out}}_{i}$ different $\bm{\phi}_{i}$ vectors for each junction, resulting in:

[TABLE]

•

Type 3: In this technique, the constraint of cyclically accessing the left memories is also eliminated. Instead, any cycle can access any cell from each of the memories. This means that storing $\bm{\phi}_{i}$ is not enough, the entire sequence of memory accesses needs to be stored as a matrix $\bm{\Phi}_{i}\in\{0,1,\cdots,D_{i}-1\}^{D_{i}\times z_{i}}$ . In Fig. 13(c) for example, $\bm{\Phi}_{i}=((1,0,2,2),(0,2,1,0),(2,1,0,1))$ for sweep 0. Every sweep would also have a different $\bm{\Phi}_{i}$ , resulting in:

[TABLE]

A technique that can be applied to all the types of clash-freedom is memory dithering, which is a permutation of the $z_{i}$ memories (*i.e., *the columns) in a bank. This permutation can change every sweep, as shown in Fig. 13(d). Memory dithering incurs an additional address computation storage cost because of the $z_{i}$ permutation, but increases $S_{M_{i}}$ by a factor $K_{i}$ . If $d^{\mathrm{in}}_{i}/z_{i}$ is an integer, an integral number of cycles are required to process each right neuron. Since a cycle accesses all memories, dithering has no effect and $K_{i}=1$ . On the other hand, if $z_{i}/d^{\mathrm{in}}_{i}$ is an integer greater than 1, the effects of dithering on connectivity patterns are only observed when switching from one right neuron to the next within a cycle. This results in

[TABLE]

for types 2 and 3, and the $d^{\mathrm{out}}_{i}$ exponent is omitted for type 1 since the access pattern does not change across sweeps.

When either of $z_{i}$ or $d^{\mathrm{in}}_{i}$ does not perfectly divide the other, an exact value of $K_{i}$ is hard to arrive at since some proper or improper fraction of right neurons are processed every cycle. In such cases, $K_{i}$ is upper-bounded by ${\left(z!\right)}^{d^{\mathrm{out}}_{i}}$ .

Table III compares the count of possible left memory access patterns and associated storage cost for computing memory addresses for types 1–3, with and without memory dither. The junction used is the same as in Fig. 4, except $N_{i}$ is raised to 12 such that $d^{\mathrm{in}}_{i}$ becomes 2 and allows us to better show the effects of memory dithering.

Bibliography51

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Image Net classification with deep convolutional neural networks,” in Proc. Advances in Neural Information Processing Systems 25 (NIPS) , 2012, pp. 1097–1105.
2[2] A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro, “Deep learning with COTS HPC systems,” in Proc. 30th Int. Conf. Machine Learning (ICML) , vol. 28, 2013, pp. III–1337–III–1345.
3[3] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural networks,” in Proc. Advances in Neural Information Processing Systems 28 (NIPS) , 2015, pp. 1135–1143.
4[4] N. P. Jouppi, C. Young, N. Patil et al. , “In-datacenter performance analysis of a tensor processing unit,” in 2017 ACM/IEEE 44th Annu. Int. Symp. Computer Architecture (ISCA) , June 2017.
5[5] C. Szegedy, W. Liu, Y. Jia et al. , “Going deeper with convolutions,” in IEEE Conf. Computer Vision and Pattern Recognition (CVPR) , 2015, pp. 1–9.
6[6] Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing deep convolutional networks using vector quantization,” in ar Xiv preprint ar Xiv:1412.6115 , 2014.
7[7] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Compressing neural networks with the hashing trick,” in Proc. 32nd Int. Conf. Machine Learning (ICML) , 2015.
8[8] S. Han, H. Mao, and W. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in Proc. Int. Conf. Learning Representations (ICLR) , 2016.