RadiX-Net: Structured Sparse Matrices for Deep Neural Networks

Ryan A. Robinett; Jeremy Kepner

arXiv:1905.00416·cs.LG·December 3, 2019

RadiX-Net: Structured Sparse Matrices for Deep Neural Networks

Ryan A. Robinett, Jeremy Kepner

PDF

TL;DR

This paper introduces RadiX-Nets, a new class of structured sparse neural network topologies that are more diverse than previous X-Net designs, aiming to match the expressive power of dense networks with lower resource requirements.

Contribution

The paper proposes a deterministic algorithm for generating RadiX-Net topologies, enhancing diversity while maintaining the properties of sparse neural networks like X-Nets.

Findings

01

RadiX-Nets are more diverse than X-Net topologies.

02

They can potentially match the expressive power of dense networks.

03

The paper presents a conjecture on the expressive capacity of sparse topologies.

Abstract

The sizes of deep neural networks (DNNs) are rapidly outgrowing the capacity of hardware to store and train them. Research over the past few decades has explored the prospect of sparsifying DNNs before, during, and after training by pruning edges from the underlying topology. The resulting neural network is known as a sparse neural network. More recent work has demonstrated the remarkable result that certain sparse DNNs can train to the same precision as dense DNNs at lower runtime and storage cost. An intriguing class of these sparse DNNs is the X-Nets, which are initialized and trained upon a sparse topology with neither reference to a parent dense DNN nor subsequent pruning. We present an algorithm that deterministically generates RadiX-Nets: sparse DNN topologies that, as a whole, are much more diverse than X-Net topologies, while preserving X-Nets' desired characteristics. We…

Figures10

Click any figure to enlarge with its caption.

Equations41

\big{\{}(n_{1},\ldots,n_{L})\mid n_{i}\in\{0,\ldots,N_{i}-1\}\big{\}}

\big{\{}(n_{1},\ldots,n_{L})\mid n_{i}\in\{0,\ldots,N_{i}-1\}\big{\}}

(n_{1}, \dots, n_{L}) ⟷ i = 1 \sum L (n_{i} j = 1 \prod i - 1 N_{j}) .

(n_{1}, \dots, n_{L}) ⟷ i = 1 \sum L (n_{i} j = 1 \prod i - 1 N_{j}) .

\left(\begin{array}[]{c | c}\mathbf{0}_{m_{i},m_{i}}&\mathbf{W}_{i}\\ \hline\cr\mathbf{0}_{n_{i},m_{i}}&\mathbf{0}_{n_{i},n_{i}}\\ \end{array}\right)

\left(\begin{array}[]{c | c}\mathbf{0}_{m_{i},m_{i}}&\mathbf{W}_{i}\\ \hline\cr\mathbf{0}_{n_{i},m_{i}}&\mathbf{0}_{n_{i},n_{i}}\\ \end{array}\right)

A^{n}=\left(\begin{array}[]{c|c}\mathbf{0}_{n,M-n}&m\mathbf{1}_{n,n}\\ \hline\cr\mathbf{0}_{M-n,M-n}&\mathbf{0}_{M-n,n}\end{array}\right),

A^{n}=\left(\begin{array}[]{c|c}\mathbf{0}_{n,M-n}&m\mathbf{1}_{n,n}\\ \hline\cr\mathbf{0}_{M-n,M-n}&\mathbf{0}_{M-n,n}\end{array}\right),

W_{i} = j = 0 \sum N_{i} - 1 P^{j ν_{i}},

W_{i} = j = 0 \sum N_{i} - 1 P^{j ν_{i}},

\left(\begin{array}[]{ccc|c}0&\ldots&0&1\\ \hline\cr&&&0\\ &\mathbf{I}_{N^{\prime}-1}&&\vdots\\ &&&0\end{array}\right),

\left(\begin{array}[]{ccc|c}0&\ldots&0&1\\ \hline\cr&&&0\\ &\mathbf{I}_{N^{\prime}-1}&&\vdots\\ &&&0\end{array}\right),

\overline{W} = (W_{1}^{*} \otimes W_{1}, \dots, W_{\overline{M}}^{*} \otimes W_{\overline{M}})

\overline{W} = (W_{1}^{*} \otimes W_{1}, \dots, W_{\overline{M}}^{*} \otimes W_{\overline{M}})

(\overline{N}_{1}, \dots, \overline{N}_{\overline{M}}) := (N_{1, 1}, \dots, N_{1, L_{1}}, N_{2, 1}, \dots, N_{M, L_{M}}),

(\overline{N}_{1}, \dots, \overline{N}_{\overline{M}}) := (N_{1, 1}, \dots, N_{1, L_{1}}, N_{2, 1}, \dots, N_{M, L_{M}}),

Δ_{G} = (\frac{1}{N ^{'}}) (\frac{\sum _{i = 1}^{\overline{M}} N _{i} D _{i - 1} D _{i}}{\sum _{i = 1}^{\overline{M}} D _{i - 1} D _{i}}) .

Δ_{G} = (\frac{1}{N ^{'}}) (\frac{\sum _{i = 1}^{\overline{M}} N _{i} D _{i - 1} D _{i}}{\sum _{i = 1}^{\overline{M}} D _{i - 1} D _{i}}) .

Δ_{G} \approx \frac{μ}{N ^{'}} .

Δ_{G} \approx \frac{μ}{N ^{'}} .

Δ_{G} \approx \frac{1}{μ ^{d - 1}} .

Δ_{G} \approx \frac{1}{μ ^{d - 1}} .

G (x) = j = 1 \sum N α_{j} σ (y_{j}^{T} x + θ_{j}),

G (x) = j = 1 \sum N α_{j} σ (y_{j}^{T} x + θ_{j}),

φ_{v} (x) = σ Θ (v) + (u, v) \in \tilde{E} (v) \sum W (u, v) φ_{u} (x);

φ_{v} (x) = σ Θ (v) + (u, v) \in \tilde{E} (v) \sum W (u, v) φ_{u} (x);

φ (x) = (φ_{u_{1}} (x), \dots, φ_{u_{∣ U_{m} ∣}} (x)) .

φ (x) = (φ_{u_{1}} (x), \dots, φ_{u_{∣ U_{m} ∣}} (x)) .

δ (X) = f \in C_{n} sup [g \in X in f (d (f, g))] .

δ (X) = f \in C_{n} sup [g \in X in f (d (f, g))] .

\left(\begin{array}[]{c | c c c}&\mathbf{W}^{*}_{1}\otimes\mathbf{W}_{1}&&\\ \mathbf{0}_{\kappa-\beta,\alpha}&&\ddots&\\ &&&\mathbf{W}^{*}_{\overline{M}}\otimes\mathbf{W}_{\overline{M}}\\ \hline\cr\mathbf{0}_{\beta,\alpha}&&\mathbf{0}_{\beta,\kappa-\alpha}&\end{array}\right).

\left(\begin{array}[]{c | c c c}&\mathbf{W}^{*}_{1}\otimes\mathbf{W}_{1}&&\\ \mathbf{0}_{\kappa-\beta,\alpha}&&\ddots&\\ &&&\mathbf{W}^{*}_{\overline{M}}\otimes\mathbf{W}_{\overline{M}}\\ \hline\cr\mathbf{0}_{\beta,\alpha}&&\mathbf{0}_{\beta,\kappa-\alpha}&\end{array}\right).

A^{\overline{M}}

A^{\overline{M}}

\displaystyle=\left(\begin{array}[]{c | c}\mathbf{0}_{\alpha,\kappa-\beta}&\left(\prod_{i=1}^{\overline{M}}\mathbf{W}^{*}_{i}\right)\otimes\left(\prod_{i=1}^{\overline{M}}\mathbf{W}_{i}\right)\\ \hline\cr\mathbf{0}_{\kappa-\alpha,\kappa-\beta}&\mathbf{0}_{\kappa-\alpha,\beta}\end{array}\right)

i = 1 \prod \overline{M} W_{i}^{*} = i = 1 \prod \overline{M} - 1 D_{i} (1_{D_{0}, D_{\overline{M}}}),

i = 1 \prod \overline{M} W_{i}^{*} = i = 1 \prod \overline{M} - 1 D_{i} (1_{D_{0}, D_{\overline{M}}}),

i = 1 \prod \overline{M} W_{i} = (N^{'})^{\overline{M} - 1} (1_{N^{'}, N^{'}}) .

i = 1 \prod \overline{M} W_{i} = (N^{'})^{\overline{M} - 1} (1_{N^{'}, N^{'}}) .

\mathbf{A}^{\overline{M}}=\left(\begin{array}[]{c | c}\mathbf{0}_{\alpha,\kappa-\beta}&\left(N^{\prime}\right)^{\overline{M}-1}\left(\prod_{i=1}^{\overline{M}-1}D_{i}\right)\left(\mathbf{1}_{\alpha,\beta}\right)\\ \hline\cr\mathbf{0}_{\kappa-\alpha,\kappa-\beta}&\mathbf{0}_{\kappa-\alpha,\beta}\end{array}\right).

\mathbf{A}^{\overline{M}}=\left(\begin{array}[]{c | c}\mathbf{0}_{\alpha,\kappa-\beta}&\left(N^{\prime}\right)^{\overline{M}-1}\left(\prod_{i=1}^{\overline{M}-1}D_{i}\right)\left(\mathbf{1}_{\alpha,\beta}\right)\\ \hline\cr\mathbf{0}_{\kappa-\alpha,\kappa-\beta}&\mathbf{0}_{\kappa-\alpha,\beta}\end{array}\right).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

RadiX-Net: Structured Sparse Matrices for Deep Neural Networks

Ryan A. Robinett1 and Jeremy Kepner1,2

1MIT Department of Mathematics, 2MIT Lincoln Laboratory Supercomputing Center

Abstract

The sizes of deep neural networks (DNNs) are rapidly outgrowing the capacity of hardware to store and train them. Research over the past few decades has explored the prospect of sparsifying DNNs before, during, and after training by pruning edges from the underlying topology. The resulting neural network is known as a sparse neural network. More recent work has demonstrated the remarkable result that certain sparse DNNs can train to the same precision as dense DNNs at lower runtime and storage cost. An intriguing class of these sparse DNNs is the X-Nets, which are initialized and trained upon a sparse topology with neither reference to a parent dense DNN nor subsequent pruning. We present an algorithm that deterministically generates RadiX-Nets: sparse DNN topologies that, as a whole, are much more diverse than X-Net topologies, while preserving X-Nets’ desired characteristics. We further present a functional-analytic conjecture based on the longstanding observation that sparse neural network topologies can attain the same expressive power as dense counterparts.

Index Terms:

sparse neural networks, sparse matrices, artificial intelligence

I Introduction

††footnotetext: This material is based in part upon work supported by the NSF under grant number DMS-1312831. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

As research in artificial neural networks progresses, the sizes of state-of-the-art deep neural network (DNN) architectures put increasing strain on the hardware needed to implement them [1, 2]. In the interest of reduced storage and runtime costs, much research over the past decade has focused on the sparsification of artificial neural networks [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]. In the listed resources alone, the methodology of sparsification includes Hessian-based pruning [3, 4], Hebbian pruning [5], matrix decomposition [9], and graph techniques [12, 10, 11, 13]. Yet all of these implementations are alike in that a DNN is initialized and trained, and then edges deemed unnecessary by certain criteria are pruned.

Unlike most strategies for creating sparse DNNs, the X-Net strategy presented in [14] is sparse “de novo”—that is, X-Nets are neural networks initialized upon sparse topologies. X-Nets are observed to train as well on various data sets as their dense counterparts, while exhibiting reduced memory usage [14, 15]. Further, by offering sparse alternatives to fully-connected and convolutional layers—X-Linear and X-Conv layers, respectively—X-Nets exhibit such performance on not only generalized DNN tasks, but also image recognition tasks canonically reserved for convolutional neural networks [9].

X-Net layers are constructed using properties of expander graphs, which give X-Nets the properties of sparsity and path-connectedness (see Mathematical Preliminaries) [14, 16]. Random X-Linear layers achieve path-connectedness probabilistically, while explicit X-Linear layers, constructed from Cayley graphs, aim to achieve path-connectedness deterministically [14]. As an artifact of their construction from Cayley graphs, explicit X-Linear layers are required have the same number of nodes as adjacent layers. This constrains the kinds of X-Net topologies which may be constructed deterministically.

We propose RadiX-Nets as a new family of de novo sparse DNNs that deterministically achieve path-connectedness while allowing for diverse layer architectures. Instead of emulating Cayley graphs, RadiX-Nets achieve sparsity using properties of mixed-radix numeral systems, while allowing for diversity in network topology through the Kronecker product [17]. Additionally, RadiX-Nets satisfy symmetry, a property which both guarantees path connectedness and precludes inherent training bias in the underlying sparse DNN architecture.

II Mathematical Preliminaries

Understanding RadiX-Nets’ graph-theoretic construction and underlying mathematical properties requires defining a few concepts. RadiX-Nets are composed of sub-nets that are herein referred to as mixed-radix topologies. Mixed-radix topologies are based on properties of mixed-radix number systems, and can be constructed from overlapping decision trees (see Figure 1). A mixed-radix numeral system is the sole parameter used to uniquely specify a mixed-radix topology. Mixed-radix topologies are a kind of feedforward neural net topology (FNNT), which is a layered graph wherein all vertices in one layer point only to some number of vertices in the next. The adjacency matrix of an FNNT is uniquely defined by the adjacency submatrices corresponding to each of its layers. Essentially, RadiX-Net topologies are constructed from Kronecker products of mixed-radix adjacency submatrices and dense DNN adjacency submatrices (see Figure 5). The main properties of interest in RadiX-Nets are path-connectedness—which ensures each output depends upon all inputs—and symmetry, which ensures that there is the same number of paths between each input and output.

Mixed-Radix Numeral System: Let $\mathcal{N}=(N_{1},\ldots,N_{L})$ be an ordered set of $L$ integers greater than 1. Let $N^{\prime}=\prod_{i=1}^{L}N_{i}$ . All such $\mathcal{N}$ implicitly define a numeral system which bijectively represents all integers in $\{0,\ldots,N^{\prime}-1\}$ . That is, the set of ordered sets

[TABLE]

maps bijectively to $\{0,\ldots,N^{\prime}-1\}$ by the map

[TABLE]

Mixed-radix numeral systems arise naturally in numerous graph-theoretic constructions, such as decision trees (see Figure 1).

Feedforward Neural Net Topology (FNNT): An FNNT $G$ with $n+1$ layers of nodes—including input and output layers—is an $(n+1)$ -partite directed graph with independent components $U_{0},\ldots,U_{n}$ satisfying the constraints that

•

if there exists an edge from $u\in U_{i}$ to $v\in U_{j}$ , then $j=i+1$ , and

•

the out-degree of $u\in U_{i}$ is nonzero for all $i<n$ .

Adjacency Submatrix of an FNNT: Say $G$ is an FNNT. Let $G_{i}$ be the restriction of $G$ to the set of nodes $U_{i-1}\cup U_{i}$ and the set of edges from $U_{i-1}$ to $U_{i}$ in $G$ . We define $m_{i}=\lvert U_{i-1}\rvert$ and $n_{i}=\lvert U_{i}\rvert$ for all $i$ . Up to a permutation of indices, the adjacency matrix of $G_{i}$ is of the form

[TABLE]

for some $\mathbf{W}_{i}$ , where $\mathbf{0}_{a,b}$ is the $a\times b$ matrix of zeros. We refer to $\mathbf{W}_{i}$ as the adjacency submatrix of the restriction $G_{i}$ .

Conversely, say that an ordered set $\mathcal{W}=(\mathbf{W}_{1},\ldots,\mathbf{W}_{n})$ of matrices is such that

•

the only nonzero entries of $\mathbf{W}_{i}$ are ones for all $i$ , and

•

no column of $\mathbf{W}_{i}$ is the zero vector.

If the number of columns in $\mathbf{W}_{i-1}$ equals the number of rows in $\mathbf{W}_{i}$ for all $i\in\{1,\ldots,n\}$ , then $\mathcal{W}$ defines a unique FNNT with $n+1$ layers of nodes.

Path-Connectedness: We define path-connectedness as follows: let $G$ be an FNNT with $n+1$ layers of nodes. $G$ is path-connected if, for every $u\in U_{0}$ and every $v\in U_{n}$ , there exists a path from $u$ to $v$ .

Symmetry: We define symmetry as follows: let $G$ be an FNNT with $n+1$ layers of nodes. $G$ is symmetric if there exists a positive integer $m$ such that, for all $u\in U_{0}$ and all $v\in U_{n}$ , there exist exactly $m$ paths from $u$ to $v$ . If $G$ is symmetric, it is path-connected. If $G$ has adjacency matrix $A$ , then $G$ satisfies symmetry if and only if, up to some permutation of $A$ ,

[TABLE]

where $M$ is the number of nodes in $G$ , $\mathbf{1}_{a,b}$ is the $a\times b$ matrix of ones, and $m$ is some positive integer.

Density of an FNNT An ordered collection $(U_{0},\ldots,U_{n})$ of sets of nodes implicitly defines a unique, fully-connected DNN topology—namely, the FNNT such that, for all $i\in\{1,\ldots,n\}$ , there exists an edge from $u$ to $v$ for all $u\in U_{i-1}$ and all $v\in U_{i}$ . The number of edges in this DNN topology is equal to $\sum_{i=1}^{n}\lvert U_{i-1}\rvert\lvert U_{i}\rvert$ . We define the density of an FNNT $G$ as the ratio of the number of edges in $G$ to the number of edges in the DNN topology defined by the ordered set of independent components of $G$ . By this construction, the highest possible density of an FNNT is one, while the lowest is $\frac{\sum_{i=1}^{n}\lvert U_{i-1}\rvert}{\sum_{i=1}^{n}\lvert U_{i-1}\rvert\lvert U_{i}\rvert}$ .

III RadiX-Net Topologies

III-A Constructing RadiX-Net Topologies

We construct RadiX-Net topologies using mixed-radix topologies as building blocks, as motivated by Figure 2.

Mixed-Radix Topologies: Let $L$ be a positive integer, and let $\mathcal{N}=(N_{1},\ldots,N_{L})$ , where $N_{i}$ is an integer greater than 1 for all $i$ . Let $N^{\prime}=\prod_{N\in\mathcal{N}}N$ , and let $U_{i}$ be a set of $N^{\prime}$ nodes—with labels $0,\ldots,N^{\prime}-1$ —for all $i\in\{0,\ldots,L\}$ . For all $i\in\{1,\ldots,L\}$ , we create edges from node $j$ in $U_{i-1}$ to node $j+n\prod_{j=1}^{i-1}N_{j}\textmd{ (mod }N^{\prime})$ in $U_{i}$ for all $n\in\{0,\ldots,N_{i}-1\}$ . Let $\mathbf{W}_{i}$ be the adjacency submatrix defining the edges from $U_{i-1}$ to $U_{i}$ . By construction, we have that

[TABLE]

where $\nu_{i}=\prod_{k=1}^{i-1}N_{k}$ and $\mathbf{P}$ is the permutation matrix

[TABLE]

$\mathbf{I}_{n}$ being the $n\times n$ identity matrix. We refer to the resulting graph as the mixed-radix topology induced by $\mathcal{N}$ .

RadiX-Net Topologies: Here, we formally construct RadiX-Net topologies using mixed-radix topologies, adjacency submatrices, and the Kronecker product, as motivated by Figure 5. For an informal programmatic construction, see Figure 6.

RadiX-Net topologies are uniquely defined by an ordered set $\mathcal{N}^{*}=(\mathcal{N}_{1},\ldots,\mathcal{N}_{M})$ of mixed-radix numeral systems $\mathcal{N}_{i}=(N_{1}^{i},\ldots,N_{L_{i}}^{i})$ together with an ordered set $\mathcal{D}$ of positive integers. We require that

there exists a positive integer $N^{\prime}$ such that $N^{\prime}=\prod_{N\in\mathcal{N}_{i}}N$ for all $i\in\{1,\ldots,M-1\}$ , and 2. 2.

$\prod_{N\in\mathcal{N}_{M}}N$ divides $N^{\prime}$ .

Let $\overline{M}=\sum_{i=1}^{M}L_{i}$ , the total number of radices in $\mathcal{N}^{*}$ ; we further require that $\mathcal{D}=(D_{0},\ldots,D_{\overline{M}})$ consist of $\overline{M}+1$ integers satisfying $D_{i}\ll N^{\prime}$ for all $i$ .

We construct a RadiX-Net $G$ using $\mathcal{N}^{*}$ and $\mathcal{D}$ as follows: let $G_{i}$ be the mixed-radix topology induced by $\mathcal{N}_{i}$ . Identifying the output nodes of $G_{i}$ with the input nodes of $G_{i+1}$ creates an $\overline{M}$ -layer FNNT with ordered set $\mathcal{W}=(\mathbf{W}_{1},\ldots,\mathbf{W}_{\overline{M}})$ of adjacency submatrices of the form (1) $\dagger$ †† $\dagger$ We refer to such an FNNT as an extended mixed-radix topology (see Appendix).. Similarly, $\mathcal{D}$ implicitly defines a unique dense DNN topology $H$ on an ordered collection $U_{0},\ldots,U_{\overline{M}}$ of nodes satisfying $\lvert U_{i}\rvert=D_{i}$ . The ordered set of adjacency matrices of $H$ is $\mathcal{W}^{*}=(\mathbf{W}_{1}^{*},\ldots,\mathbf{W}_{\overline{M}}^{*})$ , where $\mathbf{W}_{i}^{*}$ is the $D_{i-1}\times D_{i}$ matrix of ones. We define $G$ as the unique FNNT defined by

[TABLE]

(see Mathematical Preliminaries).

Mixed-radix and RadiX-Net topologies satisfy symmetry, and therefore path-connectedness. Proofs for this assertion, as well as the number of paths from any node $u$ in the input layer to a node $v$ in the output layer for each family of topologies, can be found in the Appendix.

III-B Asymptotic Sparsity of RadiX-Nets

Say $G$ is the RadiX-Net topology generated by $\mathcal{N}^{*}=(\mathcal{N}_{1},\ldots,\mathcal{N}_{M}),\mathcal{D}=(D_{0},\ldots,D_{\overline{M}})$ . Further say $\mathcal{N}_{i}=(N_{i,1},\ldots,N_{i,L_{i}})$ for all $i$ , and let $N^{\prime}$ be the integer satisfying $N^{\prime}=\prod_{N\in\mathcal{N}_{i}}N$ for all $i\in\{1,\ldots,M-1\}$ . If we define

[TABLE]

then the density $\Delta_{G}$ of $G$ is given by

[TABLE]

Let $\mu$ be the mean value of $\{\overline{N}_{i}\}$ . When $\{\overline{N}_{i}\}$ has sufficiently small variance, it follows immediately from (4) that

[TABLE]

This implies that when $\{\overline{N}_{i}\}$ has small variance, the sparsity of $G$ is negligibly affected by $\{D_{i}\}$ .

We define $d=\log_{\mu}N^{\prime}$ . For sufficiently small variance of the $\overline{N}_{i}$ , we can assume that $d$ is approximately equal to some integer, with which we can write

[TABLE]

Concretely, $\mu$ corresponds to the average radix of each mixed-radix numeral system used to construct $G$ , and $d$ corresponds to the number of radices used to construct each mixed-radix numeral system $\ddagger$ †† $\ddagger$ Per bullet 2) in Section III.A, this excludes the last mixed-radix numeral system.*††*Note that this assumption is contingent on $\{N_{i}\}$ having sufficiently small variance.. The effect of $\mu$ and $d$ on the sparsity of $G$ is shown in Figure 7.

IV Conclusions & Future Work

This paper presents the RadiX-Net algorithm, which deterministically generates sparse DNN topologies that, as a whole, are much more diverse than X-Net topologies while preserving X-Net’s desired characteristics. In a related effort, benchmarking RadiX-Net performance in comparison to X-Net, dense DNN, and other neural network implementations can be found in [15]. Furthermore, RadiX-Net is used in [18] to construct a neural net simulating the size and sparsity of the human brain.

Prabhu et al. and Alford et al. come at the end of a long history of sparse neural network research[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. This collective body mutually corroborates the following assertion: Sparse neural networks can train to the same arbitrary degree of precision as their dense counterparts. While the reduced training time of sparse neural nets can be attributed to having fewer parameters, there is no intuitive reason as to why sparse networks should demonstrate the same expressive power—as some have put it—as dense counterparts.

Naïvely, should sparse networks have the same expressive power as dense networks, dense and pruned networks would be obsolete, as de novo sparse networks achieve the expressive power of both while exceeding the training speed of both. Because the corpus of research in sparse networks seems unanimous on the subject, it would behoove the field to become more objective about what is meant when discussing expressive power, as is done in [19, 20, 21]. As demonstrated by [22], functional analysis provides a powerful language with which to describe the abilities and limitations of neural networks rigorously. In Section IV.B, we present a functional-analytic conjecture based on the mentioned experimental findings, which the authors intend to prove at a later date. Posing and proving such conjectures would direct future research in artificial neural networks more prudently than would experimental results alone.

IV-A Preliminaries for Conjecture

The most sturdy theoretical ground upon which artificial neural nets stand is Cybenko’s Universality Theorem. Though the original statement of the theorem is stronger than the corollary below, this corollary captures the significance of the Universality Theorem in the field of artificial neural networks.

Corollary.

Let $\sigma:\mathbb{R}\to\mathbb{R}$ be a continuous function such that $\lim_{t\to\infty}\sigma(t)=1$ and $\lim_{t\to-\infty}\sigma(t)=0$ (let us call this function sigmoidal). Further, let $\mathcal{C}_{n}$ be the space of continuous functions on $I_{n}=[0,1]^{n}$ with metric topology defined by supremum norm $d(f,g)=\sup_{\vec{x}\in I_{n}}\lvert f(\vec{x})-g(\vec{x})\rvert$ . Lastly, let $S$ be the set of functions of the form

[TABLE]

where $N$ is a natural number, $\alpha_{j}$ and $\theta_{j}$ are real numbers, and $\vec{y}$ is an element of $\mathbb{R}^{n}$ . The set $S$ is dense $\mathcal{C}_{n}$ . ∎

We adopt some of the language of this corollary to make our conjecture connect more immediately to the literature.

Let $\sigma$ , $I_{n}$ , $\mathcal{C}_{n}$ , and $d$ be as defined above. We define a feedforward neural network (FNN) as an FNNT $G$ , with set of edges $E$ , together with a map $W:E\to\mathbb{R}$ assigning a weight $w$ to each edge and a map $\Theta:\bigcup_{i=1}^{m}U_{i}\to\mathbb{R}$ —where $m$ is the number of non-input layers in $G$ —assigning a bias $\theta$ to each non-input node. We associate with each FNN $\mathcal{G}$ the unique map $\varphi:\mathbb{R}^{\lvert U_{0}\rvert}\to\mathbb{R}^{\lvert U_{m}\rvert}$ defined by the following:

•

let $\tilde{E}:\bigcup_{i=1}^{m}U_{i}\to E$ map each node $u$ to the set of edges going into $u$ ;

•

for all $u_{i}\in U_{0}$ , let $\varphi_{u_{i}}(x_{1},\ldots,x_{\lvert U_{0}\rvert})=x_{i}$ ;

•

for all $i\in\{1,\ldots,m\}$ and for all $v\in U_{i}$ , let

[TABLE]

•

assuming $U_{m}=\{u_{1},\ldots,u_{\lvert U_{m}\rvert}\}$ , we define

[TABLE]

Let $\mathcal{U}=(U_{0},U_{1},\ldots)$ be an infinite ordered collection of finite sets of nodes such that $\lvert U_{0}\rvert=n$ . Let $\mathfrak{D}$ be the unique fully-connected FNNT on $\mathcal{U}$ , and let $\mathcal{S}$ be some sparse FNNT on $\mathcal{U}$ satisfying symmetry. We define $\mathfrak{D}_{N}$ and $\mathcal{S}_{N}$ as the unique FNNTs constructed by restricting $\mathfrak{D}$ and $\mathcal{S}$ , respectively, to the set of nodes $\bigcup_{i=0}^{N}U_{i}$ , introducing a new node $v$ , and creating and edge from $u$ to $v$ for all $u\in U_{N}$ . Finally, let $\mathbb{D}_{N}$ and $\mathbb{S}_{N}$ be the sets of continuous functions which can be represented as FNNs on $\mathfrak{D}_{N}$ and $\mathcal{S}_{N}$ , respectively.

IV-B Functional-Analytic Conjecture

Due to the findings of Prabhu et al., Alford et al., and others, we are convinced that de novo sparse neural network topologies exhibit the same expressive power of fully-connected DNN topologies in the following way.

Conjecture.

For all $\mathbb{X}\subset\mathcal{C}_{n}$ , we define

[TABLE]

If $\delta(\mathbb{D}_{N})$ is in $O(N^{-p})$ for some $p$ , then $\delta(\mathbb{S}_{N})$ is also in $O(N^{-p})$ . ∎

Acknowledgment

The authors wish to acknowledge the following individuals for their contributions and support: Simon Alford, Alan Edelman, Vijay Gadepally, Chris Hill, Hayden Jananthan, Lauren Milechin, Richard Wang, and the MIT SuperCloud team.

Appendix

For purposes of simplifying Theorem 1, we use the following two lemmas. Lemma 2 discusses extended mixed-radix topologies, which we define as RadiX-Net topologies generated by $\mathcal{N}^{*},\mathcal{D}=(D_{0},\ldots,D_{\overline{M}})$ satisfying $D_{i}=1$ for all $i$ .

Lemma 1.

Mixed-radix topologies satisfy symmetry, and the number of paths from an input node $u$ to an output node $v$ is one.

Proof.

This follows directly from the definition of a mixed-radix numeral system. ∎

Lemma 2.

Let $G$ be the extended mixed-radix (EMR) topology defined by some $\mathcal{N}^{*}=(\mathcal{N}_{1},\ldots,\mathcal{N}_{M})$ satisfying the RadiX-Net constraints (see Section III: RadiX-Net Topologies). $G$ satisfies symmetry, and the number of paths from an input node $u$ to an output node $v$ is $(N^{\prime})^{M-1}$ , where $N^{\prime}$ is the integer satisfying $N^{\prime}=\prod_{N\in\mathcal{N}_{i}}N$ for all $i\in\{1,\ldots,M-1\}$ .

Proof.

We show this by induction. Say that, for some positive integer $M$ , all EMR topologies $G$ defined by some $\mathcal{N}^{*}=(\mathcal{N}_{1},\ldots,\mathcal{N}_{M})$ satisfy symmetry. Let $\mathcal{N}_{+}^{*}=(\mathcal{N}_{1},\ldots,\mathcal{N}_{M},\mathcal{N}_{M+1})$ for some $\mathcal{N}_{M+1}$ satisfying the RadiX-Net constraints, and let $G_{+}$ be the EMR topology induced by $\mathcal{N}_{+}^{*}$ . Recall that $G_{+}$ is formed from the disjoint union of the MR topologies $G_{i}$ (generated by $\mathcal{N}_{i}$ ) by identifying $U_{i-1,L_{i-1}}$ and $U_{i,0}$ for all $i$ (here, $U_{i,L_{i}}$ and $U_{i,0}$ simply refer to the output and input layers, respectively, of $G_{i}$ ). Because $G_{M+1}$ is an MR topology, Lemma 1 guarantees that there exists exactly one path from $u$ to $v$ for all $u\in U_{0}^{M+1}$ and all $v\in U_{L_{M+1}}^{M+1}$ . By hypothesis, for some positive integer $m$ , there exist exactly $m$ paths from $\tilde{u}\in U_{1,0}$ to $\tilde{v}\in U_{M,L_{M}}$ for all such $\tilde{u},\tilde{v}$ . Because $U_{M,L_{M}}$ and $U_{M+1,0}$ are identified, this implies that for every path from $\tilde{u}\in U_{1,0}$ to $\tilde{v}\in U_{M,L_{M}}$ , there exists exactly one path from $\tilde{u}$ to $v\in U_{M+1,L_{M+1}}$ which passes through $\tilde{v}$ . Further, because there are $\lvert U_{M+1,0}\rvert$ such $\tilde{v}$ , there exist exactly $m\lvert U_{M+1,0}\rvert$ paths from $\tilde{u}$ to $v$ for all choices of $\tilde{u},v$ . By induction from the case $M=1$ (i.e. Lemma 1), $G_{+}$ satisfies symmetry, and $m=\prod_{i=2}^{M}\lvert U_{i,0}\rvert=(N^{\prime})^{M-1}$ . ∎

Theorem 1.

Let $G$ be the RadiX-Net topology defined by some $\mathcal{N}^{*}=(\mathcal{N}_{1},\ldots,\mathcal{N}_{M}),\mathcal{D}=(D_{0},\ldots,D_{\overline{M}})$ satisfying the RadiX-Net constraints. We order the layers $\overline{U}_{0},\ldots,\overline{U}_{\overline{M}}$ of $G$ in the natural way, where $\overline{U}_{0}$ and $\overline{U}_{\overline{M}}$ are the input and output layers, respectively, of $G$ . $G$ satisfies symmetry, and the number of paths from input node $u$ to output node $v$ is given by $(N^{\prime})^{\overline{M}-1}\left(\prod_{i=1}^{\overline{M}-1}D_{i}\right)$ , where $N^{\prime}$ is the integer satisfying $N^{\prime}=\prod_{N\in\mathcal{N}_{i}}N$ for all $i\in\{1,\ldots,M-1\}$ .

Proof.

Let $\mathbf{A}$ be the adjacency matrix of $G$ , and let $\mathbf{W}^{*}_{i},\mathbf{W}_{i}$ be as defined in (3). We define $\kappa=N^{\prime}\sum_{i=0}^{\overline{M}}D_{i}$ , $\alpha=N^{\prime}D_{0}$ , and $\beta=N^{\prime}D_{\overline{M}}$ . Up to a permutation, $\mathbf{A}$ is of the form

[TABLE]

Therefore, the following statements hold.

[TABLE]

The deduction above is consequent of the mixed-product property of the Kronecker product[17]. It is easy to show that

[TABLE]

where $\mathbf{1}_{a,b}$ is the $a\times b$ matrix of ones. By Lemma 2, it holds that

[TABLE]

Therefore,

[TABLE]

So $G$ satisfies symmetry, and for all input nodes $u$ and output nodes $v$ , there exist exactly $\left(N^{\prime}\right)^{\overline{M}-1}\left(\prod_{i=1}^{\overline{M}-1}D_{i}\right)$ paths from $u$ to $v$ . ∎

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 1–9, June 2015.
2[2] J. Kepner, V. Gadepally, H. Jananthan, L. Milechin, and S. Samsi, “Sparse deep neural network exact solutions,” in High Performance Extreme Computing Conference (HPEC) , IEEE, 2018.
3[3] Y. Le Cun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Advances in neural information processing systems , pp. 598–605, 1990.
4[4] B. Hassibi and D. G. Stork, “Second order derivatives for network pruning: Optimal brain surgeon,” in Advances in neural information processing systems , pp. 164–171, 1993.
5[5] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research , vol. 15, no. 1, pp. 1929–1958, 2014.
6[6] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size,” ar Xiv preprint ar Xiv:1602.07360 , 2016.
7[7] S. Srinivas and R. V. Babu, “Data-free parameter pruning for deep neural networks,” Co RR , vol. abs/1507.06149, 2015.
8[8] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” Co RR , vol. abs/1510.00149, 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

RadiX-Net: Structured Sparse Matrices for Deep Neural Networks

Abstract

Index Terms:

I Introduction

II Mathematical Preliminaries

III RadiX-Net Topologies

III-A Constructing RadiX-Net Topologies

III-B Asymptotic Sparsity of RadiX-Nets

IV Conclusions & Future Work

IV-A Preliminaries for Conjecture

Corollary**.**

IV-B Functional-Analytic Conjecture

Conjecture**.**

Acknowledgment

Appendix

Lemma 1**.**

Proof.

Lemma 2**.**

Proof.

Theorem 1**.**

Proof.

Corollary.

Conjecture.

Lemma 1.

Lemma 2.

Theorem 1.