RadiX-Net: Structured Sparse Matrices for Deep Neural Networks
Ryan A. Robinett, Jeremy Kepner

TL;DR
This paper introduces RadiX-Nets, a new class of structured sparse neural network topologies that are more diverse than previous X-Net designs, aiming to match the expressive power of dense networks with lower resource requirements.
Contribution
The paper proposes a deterministic algorithm for generating RadiX-Net topologies, enhancing diversity while maintaining the properties of sparse neural networks like X-Nets.
Findings
RadiX-Nets are more diverse than X-Net topologies.
They can potentially match the expressive power of dense networks.
The paper presents a conjecture on the expressive capacity of sparse topologies.
Abstract
The sizes of deep neural networks (DNNs) are rapidly outgrowing the capacity of hardware to store and train them. Research over the past few decades has explored the prospect of sparsifying DNNs before, during, and after training by pruning edges from the underlying topology. The resulting neural network is known as a sparse neural network. More recent work has demonstrated the remarkable result that certain sparse DNNs can train to the same precision as dense DNNs at lower runtime and storage cost. An intriguing class of these sparse DNNs is the X-Nets, which are initialized and trained upon a sparse topology with neither reference to a parent dense DNN nor subsequent pruning. We present an algorithm that deterministically generates RadiX-Nets: sparse DNN topologies that, as a whole, are much more diverse than X-Net topologies, while preserving X-Nets' desired characteristics. We…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
RadiX-Net: Structured Sparse Matrices for Deep Neural Networks
Ryan A. Robinett1 and Jeremy Kepner1,2
1MIT Department of Mathematics, 2MIT Lincoln Laboratory Supercomputing Center
Abstract
The sizes of deep neural networks (DNNs) are rapidly outgrowing the capacity of hardware to store and train them. Research over the past few decades has explored the prospect of sparsifying DNNs before, during, and after training by pruning edges from the underlying topology. The resulting neural network is known as a sparse neural network. More recent work has demonstrated the remarkable result that certain sparse DNNs can train to the same precision as dense DNNs at lower runtime and storage cost. An intriguing class of these sparse DNNs is the X-Nets, which are initialized and trained upon a sparse topology with neither reference to a parent dense DNN nor subsequent pruning. We present an algorithm that deterministically generates RadiX-Nets: sparse DNN topologies that, as a whole, are much more diverse than X-Net topologies, while preserving X-Nets’ desired characteristics. We further present a functional-analytic conjecture based on the longstanding observation that sparse neural network topologies can attain the same expressive power as dense counterparts.
Index Terms:
sparse neural networks, sparse matrices, artificial intelligence
I Introduction
††footnotetext: This material is based in part upon work supported by the NSF under grant number DMS-1312831. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
As research in artificial neural networks progresses, the sizes of state-of-the-art deep neural network (DNN) architectures put increasing strain on the hardware needed to implement them [1, 2]. In the interest of reduced storage and runtime costs, much research over the past decade has focused on the sparsification of artificial neural networks [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]. In the listed resources alone, the methodology of sparsification includes Hessian-based pruning [3, 4], Hebbian pruning [5], matrix decomposition [9], and graph techniques [12, 10, 11, 13]. Yet all of these implementations are alike in that a DNN is initialized and trained, and then edges deemed unnecessary by certain criteria are pruned.
Unlike most strategies for creating sparse DNNs, the X-Net strategy presented in [14] is sparse “de novo”—that is, X-Nets are neural networks initialized upon sparse topologies. X-Nets are observed to train as well on various data sets as their dense counterparts, while exhibiting reduced memory usage [14, 15]. Further, by offering sparse alternatives to fully-connected and convolutional layers—X-Linear and X-Conv layers, respectively—X-Nets exhibit such performance on not only generalized DNN tasks, but also image recognition tasks canonically reserved for convolutional neural networks [9].
X-Net layers are constructed using properties of expander graphs, which give X-Nets the properties of sparsity and path-connectedness (see Mathematical Preliminaries) [14, 16]. Random X-Linear layers achieve path-connectedness probabilistically, while explicit X-Linear layers, constructed from Cayley graphs, aim to achieve path-connectedness deterministically [14]. As an artifact of their construction from Cayley graphs, explicit X-Linear layers are required have the same number of nodes as adjacent layers. This constrains the kinds of X-Net topologies which may be constructed deterministically.
We propose RadiX-Nets as a new family of de novo sparse DNNs that deterministically achieve path-connectedness while allowing for diverse layer architectures. Instead of emulating Cayley graphs, RadiX-Nets achieve sparsity using properties of mixed-radix numeral systems, while allowing for diversity in network topology through the Kronecker product [17]. Additionally, RadiX-Nets satisfy symmetry, a property which both guarantees path connectedness and precludes inherent training bias in the underlying sparse DNN architecture.
II Mathematical Preliminaries
Understanding RadiX-Nets’ graph-theoretic construction and underlying mathematical properties requires defining a few concepts. RadiX-Nets are composed of sub-nets that are herein referred to as mixed-radix topologies. Mixed-radix topologies are based on properties of mixed-radix number systems, and can be constructed from overlapping decision trees (see Figure 1). A mixed-radix numeral system is the sole parameter used to uniquely specify a mixed-radix topology. Mixed-radix topologies are a kind of feedforward neural net topology (FNNT), which is a layered graph wherein all vertices in one layer point only to some number of vertices in the next. The adjacency matrix of an FNNT is uniquely defined by the adjacency submatrices corresponding to each of its layers. Essentially, RadiX-Net topologies are constructed from Kronecker products of mixed-radix adjacency submatrices and dense DNN adjacency submatrices (see Figure 5). The main properties of interest in RadiX-Nets are path-connectedness—which ensures each output depends upon all inputs—and symmetry, which ensures that there is the same number of paths between each input and output.
Mixed-Radix Numeral System: Let be an ordered set of integers greater than 1. Let . All such implicitly define a numeral system which bijectively represents all integers in . That is, the set of ordered sets
[TABLE]
maps bijectively to by the map
[TABLE]
Mixed-radix numeral systems arise naturally in numerous graph-theoretic constructions, such as decision trees (see Figure 1).
Feedforward Neural Net Topology (FNNT): An FNNT with layers of nodes—including input and output layers—is an -partite directed graph with independent components satisfying the constraints that
- •
if there exists an edge from to , then , and
- •
the out-degree of is nonzero for all .
Adjacency Submatrix of an FNNT: Say is an FNNT. Let be the restriction of to the set of nodes and the set of edges from to in . We define and for all . Up to a permutation of indices, the adjacency matrix of is of the form
[TABLE]
for some , where is the matrix of zeros. We refer to as the adjacency submatrix of the restriction .
Conversely, say that an ordered set of matrices is such that
- •
the only nonzero entries of are ones for all , and
- •
no column of is the zero vector.
If the number of columns in equals the number of rows in for all , then defines a unique FNNT with layers of nodes.
Path-Connectedness: We define path-connectedness as follows: let be an FNNT with layers of nodes. is path-connected if, for every and every , there exists a path from to .
Symmetry: We define symmetry as follows: let be an FNNT with layers of nodes. is symmetric if there exists a positive integer such that, for all and all , there exist exactly paths from to . If is symmetric, it is path-connected. If has adjacency matrix , then satisfies symmetry if and only if, up to some permutation of ,
[TABLE]
where is the number of nodes in , is the matrix of ones, and is some positive integer.
Density of an FNNT An ordered collection of sets of nodes implicitly defines a unique, fully-connected DNN topology—namely, the FNNT such that, for all , there exists an edge from to for all and all . The number of edges in this DNN topology is equal to . We define the density of an FNNT as the ratio of the number of edges in to the number of edges in the DNN topology defined by the ordered set of independent components of . By this construction, the highest possible density of an FNNT is one, while the lowest is .
III RadiX-Net Topologies
III-A Constructing RadiX-Net Topologies
We construct RadiX-Net topologies using mixed-radix topologies as building blocks, as motivated by Figure 2.
Mixed-Radix Topologies: Let be a positive integer, and let , where is an integer greater than 1 for all . Let , and let be a set of nodes—with labels —for all . For all , we create edges from node in to node in for all . Let be the adjacency submatrix defining the edges from to . By construction, we have that
[TABLE]
where and is the permutation matrix
[TABLE]
being the identity matrix. We refer to the resulting graph as the mixed-radix topology induced by .
RadiX-Net Topologies: Here, we formally construct RadiX-Net topologies using mixed-radix topologies, adjacency submatrices, and the Kronecker product, as motivated by Figure 5. For an informal programmatic construction, see Figure 6.
RadiX-Net topologies are uniquely defined by an ordered set of mixed-radix numeral systems together with an ordered set of positive integers. We require that
there exists a positive integer such that for all , and 2. 2.
divides .
Let , the total number of radices in ; we further require that consist of integers satisfying for all .
We construct a RadiX-Net using and as follows: let be the mixed-radix topology induced by . Identifying the output nodes of with the input nodes of creates an -layer FNNT with ordered set of adjacency submatrices of the form (1)††We refer to such an FNNT as an extended mixed-radix topology (see Appendix).. Similarly, implicitly defines a unique dense DNN topology on an ordered collection of nodes satisfying . The ordered set of adjacency matrices of is , where is the matrix of ones. We define as the unique FNNT defined by
[TABLE]
(see Mathematical Preliminaries).
Mixed-radix and RadiX-Net topologies satisfy symmetry, and therefore path-connectedness. Proofs for this assertion, as well as the number of paths from any node in the input layer to a node in the output layer for each family of topologies, can be found in the Appendix.
III-B Asymptotic Sparsity of RadiX-Nets
Say is the RadiX-Net topology generated by . Further say for all , and let be the integer satisfying for all . If we define
[TABLE]
then the density of is given by
[TABLE]
Let be the mean value of . When has sufficiently small variance, it follows immediately from (4) that
[TABLE]
This implies that when has small variance, the sparsity of is negligibly affected by .
We define . For sufficiently small variance of the , we can assume that is approximately equal to some integer, with which we can write
[TABLE]
Concretely, corresponds to the average radix of each mixed-radix numeral system used to construct , and corresponds to the number of radices used to construct each mixed-radix numeral system††Per bullet 2) in Section III.A, this excludes the last mixed-radix numeral system.*††*Note that this assumption is contingent on having sufficiently small variance.. The effect of and on the sparsity of is shown in Figure 7.
IV Conclusions & Future Work
This paper presents the RadiX-Net algorithm, which deterministically generates sparse DNN topologies that, as a whole, are much more diverse than X-Net topologies while preserving X-Net’s desired characteristics. In a related effort, benchmarking RadiX-Net performance in comparison to X-Net, dense DNN, and other neural network implementations can be found in [15]. Furthermore, RadiX-Net is used in [18] to construct a neural net simulating the size and sparsity of the human brain.
Prabhu et al. and Alford et al. come at the end of a long history of sparse neural network research[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. This collective body mutually corroborates the following assertion: Sparse neural networks can train to the same arbitrary degree of precision as their dense counterparts. While the reduced training time of sparse neural nets can be attributed to having fewer parameters, there is no intuitive reason as to why sparse networks should demonstrate the same expressive power—as some have put it—as dense counterparts.
Naïvely, should sparse networks have the same expressive power as dense networks, dense and pruned networks would be obsolete, as de novo sparse networks achieve the expressive power of both while exceeding the training speed of both. Because the corpus of research in sparse networks seems unanimous on the subject, it would behoove the field to become more objective about what is meant when discussing expressive power, as is done in [19, 20, 21]. As demonstrated by [22], functional analysis provides a powerful language with which to describe the abilities and limitations of neural networks rigorously. In Section IV.B, we present a functional-analytic conjecture based on the mentioned experimental findings, which the authors intend to prove at a later date. Posing and proving such conjectures would direct future research in artificial neural networks more prudently than would experimental results alone.
IV-A Preliminaries for Conjecture
The most sturdy theoretical ground upon which artificial neural nets stand is Cybenko’s Universality Theorem. Though the original statement of the theorem is stronger than the corollary below, this corollary captures the significance of the Universality Theorem in the field of artificial neural networks.
Corollary**.**
Let be a continuous function such that and (let us call this function sigmoidal). Further, let be the space of continuous functions on with metric topology defined by supremum norm . Lastly, let be the set of functions of the form
[TABLE]
where is a natural number, and are real numbers, and is an element of . The set is dense . ∎
We adopt some of the language of this corollary to make our conjecture connect more immediately to the literature.
Let , , , and be as defined above. We define a feedforward neural network (FNN) as an FNNT , with set of edges , together with a map assigning a weight to each edge and a map —where is the number of non-input layers in —assigning a bias to each non-input node. We associate with each FNN the unique map defined by the following:
- •
let map each node to the set of edges going into ;
- •
for all , let ;
- •
for all and for all , let
[TABLE]
- •
assuming , we define
[TABLE]
Let be an infinite ordered collection of finite sets of nodes such that . Let be the unique fully-connected FNNT on , and let be some sparse FNNT on satisfying symmetry. We define and as the unique FNNTs constructed by restricting and , respectively, to the set of nodes , introducing a new node , and creating and edge from to for all . Finally, let and be the sets of continuous functions which can be represented as FNNs on and , respectively.
IV-B Functional-Analytic Conjecture
Due to the findings of Prabhu et al., Alford et al., and others, we are convinced that de novo sparse neural network topologies exhibit the same expressive power of fully-connected DNN topologies in the following way.
Conjecture**.**
For all , we define
[TABLE]
If is in for some , then is also in . ∎
Acknowledgment
The authors wish to acknowledge the following individuals for their contributions and support: Simon Alford, Alan Edelman, Vijay Gadepally, Chris Hill, Hayden Jananthan, Lauren Milechin, Richard Wang, and the MIT SuperCloud team.
Appendix
For purposes of simplifying Theorem 1, we use the following two lemmas. Lemma 2 discusses extended mixed-radix topologies, which we define as RadiX-Net topologies generated by satisfying for all .
Lemma 1**.**
Mixed-radix topologies satisfy symmetry, and the number of paths from an input node to an output node is one.
Proof.
This follows directly from the definition of a mixed-radix numeral system. ∎
Lemma 2**.**
Let be the extended mixed-radix (EMR) topology defined by some satisfying the RadiX-Net constraints (see Section III: RadiX-Net Topologies). satisfies symmetry, and the number of paths from an input node to an output node is , where is the integer satisfying for all .
Proof.
We show this by induction. Say that, for some positive integer , all EMR topologies defined by some satisfy symmetry. Let for some satisfying the RadiX-Net constraints, and let be the EMR topology induced by . Recall that is formed from the disjoint union of the MR topologies (generated by ) by identifying and for all (here, and simply refer to the output and input layers, respectively, of ). Because is an MR topology, Lemma 1 guarantees that there exists exactly one path from to for all and all . By hypothesis, for some positive integer , there exist exactly paths from to for all such . Because and are identified, this implies that for every path from to , there exists exactly one path from to which passes through . Further, because there are such , there exist exactly paths from to for all choices of . By induction from the case (i.e. Lemma 1), satisfies symmetry, and . ∎
Theorem 1**.**
Let be the RadiX-Net topology defined by some satisfying the RadiX-Net constraints. We order the layers of in the natural way, where and are the input and output layers, respectively, of . satisfies symmetry, and the number of paths from input node to output node is given by , where is the integer satisfying for all .
Proof.
Let be the adjacency matrix of , and let be as defined in (3). We define , , and . Up to a permutation, is of the form
[TABLE]
Therefore, the following statements hold.
[TABLE]
The deduction above is consequent of the mixed-product property of the Kronecker product[17]. It is easy to show that
[TABLE]
where is the matrix of ones. By Lemma 2, it holds that
[TABLE]
Therefore,
[TABLE]
So satisfies symmetry, and for all input nodes and output nodes , there exist exactly paths from to . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 1–9, June 2015.
- 2[2] J. Kepner, V. Gadepally, H. Jananthan, L. Milechin, and S. Samsi, “Sparse deep neural network exact solutions,” in High Performance Extreme Computing Conference (HPEC) , IEEE, 2018.
- 3[3] Y. Le Cun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Advances in neural information processing systems , pp. 598–605, 1990.
- 4[4] B. Hassibi and D. G. Stork, “Second order derivatives for network pruning: Optimal brain surgeon,” in Advances in neural information processing systems , pp. 164–171, 1993.
- 5[5] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research , vol. 15, no. 1, pp. 1929–1958, 2014.
- 6[6] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size,” ar Xiv preprint ar Xiv:1602.07360 , 2016.
- 7[7] S. Srinivas and R. V. Babu, “Data-free parameter pruning for deep neural networks,” Co RR , vol. abs/1507.06149, 2015.
- 8[8] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” Co RR , vol. abs/1510.00149, 2015.
