SCANN: Synthesis of Compact and Accurate Neural Networks
Shayan Hassantabar, Zeyu Wang, Niraj K. Jha

TL;DR
This paper introduces SCANN, a novel neural network synthesis method that creates compact, accurate models through growth and pruning operations, and combines it with dataset reduction to improve efficiency for various applications.
Contribution
The paper presents a new synthesis approach, SCANN, and its combination with dataset dimensionality reduction, enabling the design of efficient neural networks with minimal accuracy loss.
Findings
SCANN produces neural networks with improved accuracy and efficiency.
DR+SCANN effectively reduces dataset dimensionality, enhancing model compactness.
The methods outperform traditional architectures on multiple benchmarks.
Abstract
Deep neural networks (DNNs) have become the driving force behind recent artificial intelligence (AI) research. An important problem with implementing a neural network is the design of its architecture. Typically, such an architecture is obtained manually by exploring its hyperparameter space and kept fixed during training. This approach is time-consuming and inefficient. Another issue is that modern neural networks often contain millions of parameters, whereas many applications and devices require small inference models. However, efforts to migrate DNNs to such devices typically entail a significant loss of classification accuracy. To address these challenges, we propose a two-step neural network synthesis methodology, called DR+SCANN, that combines two complementary approaches to design compact and accurate DNNs. At the core of our framework is the SCANN methodology that uses three…
| Dataset | Training Set | Validation Set | Test Set | Features | Classes |
|---|---|---|---|---|---|
| Sensorless Drive Diagnosis | |||||
| Human Activity Recognition (HAR) | |||||
| Musk v | |||||
| Pen-Based Recognition of Handwritten Digits | |||||
| Landsat Satellite Image | |||||
| Letter Recognition | |||||
| Epileptic Seizure Recognition | |||||
| Smartphone Human Activity Recognition | |||||
| DNA |
| Methods | Error rate | #Params | Compression ratio | |
|---|---|---|---|---|
| Baseline | 0.72% | 430.5K | 1.0 | |
| Network pruning [28] | 0.77% | 34.5K | 12.5 | |
| Scheme A | 0.68% | 184.6K | 2.3 | |
| Scheme B | 0.72% | 19.3K | 22.3 | |
| Scheme C | 0.72% | 9.3K | 46.3 |
| Methods | Error rate | #Params | FLOPs |
|---|---|---|---|
| Ciresan et al [41] | 12.0M | 23.9M | |
| Scheme C | 0.6M (20.0) | 1.2M (19.9) |
| Dataset | MLP | DR (H.A.) | DR (M.C.) | SCANN (H.A.) | SCANN (M.C.) | DR+SCANN (H.A.) | DR+SCANN (M.C.) |
|---|---|---|---|---|---|---|---|
| SenDrive | (FA) | (FA) | |||||
| HAR | (ICA) | (ICA) | |||||
| Musk | (FA) | (FA) | |||||
| Pendigits | (Isomap) | (Isomap) | |||||
| SatIm | (PCA) | (PCA) | |||||
| Letter | (PCA) | (PCA) | |||||
| Seizure | (FA) | (FA) | |||||
| SHAR | (RP) | (RP) | |||||
| DNA | (FA) | (FA) |
| Dataset | MLP | DR (H.A.) | DR (M.C.) | SCANN (H.A.) | SCANN (M.C.) | DR+SCANN (H.A.) | DR+SCANN (M.C.) |
|---|---|---|---|---|---|---|---|
| SenDrive | k () | k () | () | k () | () | k () | () |
| HAR | k () | k () | k () | k () | k () | k () | () |
| Musk | k () | k () | k () | k () | k () | () | () |
| Pendigits | k () | () | () | k () | k () | () | () |
| SatIm | k () | k () | k () | k () | k () | k () | k () |
| Letter | k () | k () | k () | k () | k () | k () | k () |
| Seizure | k () | k () | () | k () | k () | k () | () |
| SHAR | k () | k () | k () | k () | () | k () | () |
| DNA | k () | k () | k () | k () | () | () | () |
| Dataset | MLP | DR (H.A.) | DR (M.C.) | SCANN (H.A.) | SCANN (M.C.) | DR+SCANN (H.A.) | DR+SCANN (M.C.) |
|---|---|---|---|---|---|---|---|
| SenDrive | e- | e- | e- | e- | e- | e- | e- |
| HAR | e- | e- | e- | e- | e- | e- | e- |
| Musk | e- | e- | e- | e- | e- | e- | e- |
| Pendigits | e- | e- | e- | e- | e- | e- | e- |
| SatIm | e- | e- | e- | e- | e- | e- | e- |
| Letter | e- | e- | e- | e- | e- | e- | e- |
| Seizure | e- | e- | e- | e- | e- | e- | e- |
| SHAR | e- | e- | e- | e- | e- | e- | e- |
| DNA | e- | e- | e- | e- | e- | e- | e- |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
SCANN: Synthesis of Compact and Accurate Neural Networks
Shayan Hassantabar, Zeyu Wang, and Niraj K. Jha This work was supported by IP Group and NSF Grant No. CNS-1617640 and CNS-1907381. Shayan Hassantabar, Zeyu Wang, and Niraj K. Jha are with the Department of Electrical Engineering, Princeton University, Princeton, NJ, 08544 USA, e-mail:{seyedh, zeyuwang,jha}@princeton.edu.
Abstract
Deep neural networks (DNNs) have become the driving force behind recent artificial intelligence (AI) research. With the help of a vast amount of training data, neural networks can perform better than traditional machine learning algorithms in many applications. An important problem with implementing a neural network is the design of its architecture. Typically, such an architecture is obtained manually by exploring its hyperparameter space and kept fixed during training. This approach is both time-consuming and inefficient. Another issue is that modern neural networks often contain millions of parameters, whereas many applications require small inference models due to imposed resource constraints, such as energy constraints on battery-operated devices. However, efforts to migrate DNNs to such devices typically entail a significant loss of classification accuracy. To address these challenges, we propose a two-step neural network synthesis methodology, called DR+SCANN, that combines two complementary approaches to design compact and accurate DNNs. At the core of our framework is the SCANN methodology that uses three basic architecture-changing operations, namely connection growth, neuron growth, and connection pruning, to synthesize feed-forward architectures with arbitrary structure. These neural networks are not limited to the multilayer perceptron structure. SCANN encapsulates three synthesis methodologies that apply a repeated grow-and-prune paradigm to three architectural starting points. DR+SCANN combines the SCANN methodology with dataset dimensionality reduction to alleviate the curse of dimensionality. We demonstrate the efficacy of SCANN and DR+SCANN on various image and non-image datasets. We evaluate SCANN on MNIST and ImageNet benchmarks. Without any loss in accuracy, SCANN generates a 46.3 smaller network than the LeNet-5 Caffe model. We also compare SCANN-synthesized networks with a state-of-the-art fully-connected feed-forward model for MNIST, and show () reduction in number of parameters (floating-point operations) with little drop in accuracy. On the ImageNet dataset, for the VGG-16 and MobileNetV2 architectures, we reduce the network parameters by and with a similar performance or improvement over their respective baselines. We also evaluate the efficacy of using dimensionality reduction alongside SCANN (DR+SCANN) on nine small to medium-size datasets. Using this methodology enables us to reduce the number of connections in the network by up to (geometric mean: 82.1), with little to no drop in accuracy. We also show that our synthesis methodology yields neural networks that are much better at navigating the accuracy vs. energy efficiency space. This would enable neural network-based inference even on Internet-of-Things sensors.
Index Terms:
Architecture synthesis; compact network; compression; dimensionality reduction; energy efficiency; neural network.
1 Introduction
Artificial neural networks have a long history, dating back to 1950’s. However, interest in neural networks has waxed and waned over the years. The recent spurt in interest in deep neural networks (DNNs) is due to large datasets becoming available, enabling them to be trained to high accuracy. This trend is due to a significant increase in computing power that speeds up the training process. DNNs demonstrate very high classification accuracies for many applications of interest, e.g., image recognition, speech recognition, and machine translation. They have become deeper, with tens to hundreds of layers. Thus, the phrase ‘deep learning’ is often associated with such DNNs. Deep learning refers to the ability of DNNs to learn hierarchically, with complex features built upon simple ones.
The DNN architecture trained on a specific dataset has a great impact on the final performance of the model. For example, Table I compares several well-known DNNs designed for the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012-2016 [1]. We show the architectures in terms of the number of parameters in the network (#params) and floating-point operations (FLOPs), as well as their performance on this task, i.e., the top-5 accuracy. Although all these well-known DNN architectures were obtained using the same training dataset and the same back-propagation (BP) algorithm for training weights, due to their architectural differences, their performance is vastly different in terms of classification accuracy, computational costs, and memory requirements.
Though critically important, how to derive an appropriate DNN architecture for small, medium, and large datasets has remained a vexing problem. Since the DNN architecture directly influences the learned representations and thus the performance of the model, this is an important challenge in deploying DNNs in practice and using their knowledge distillation power in various applications, such as smart healthcare [7, 8]. Typically, it takes researchers a huge amount of time through much trial-and-error to find a good architecture because the search space is exponentially large with respect to many of its hyperparameters. Furthermore, these architectures need to be trained on large datasets. This approach suffers from four major problems:
- •
Fixed network architecture: These methods use the BP algorithm to train the weights, and not to optimize the architecture. This means that the DNN architecture, including the depth and connections of the network, is kept fixed during the training process. This does not lead to better DNN architectures.
- •
Lengthy search process: Searching for an accurate DNN architecture through trial and error is inefficient. This problem is exacerbated when the DNN becomes larger. Each trial can easily take tens of hours on fast graphical processing units (GPUs). In addition, it takes months to design more efficient architectures for certain tasks, such as the architectures shown in Table I for image classification.
- •
Architectural redundancy: Most DNNs suffer from substantial storage and computation redundancy [9]. For example, Dai et al. [10] show that the number of parameters and the FLOPs of ResNet-50 can be reduced by 4.1 and 2.1, respectively, without loss of accuracy.
- •
Need for large datasets: Collecting a large number of data instances and manually labeling them is a costly process, especially in domains where experts are needed to label the data instances, such as data collected for healthcare [11]. Using synthetic data generated from the same distribution as the real data, however, can reduce the need for large datasets [12].
As the number of features, i.e., dimension, of the dataset increases, in order to generalize accurately, we need exponentially more data. This is another challenge that is referred to as the curse of dimensionality. Hence, one way to reduce the need for large amounts of data is to reduce the dimensionality of the dataset. In addition, with the same amount of data, by reducing the number of features, the accuracy of the inference model may also improve to a degree. However, beyond a certain point, which is dataset-dependent, reducing the number of features may lead to loss of information, which may lead to inferior classification results.
To address the above problems, we propose a new DNN synthesis tool called DR+SCANN that combines two different approaches to synthesize very compact, yet accurate, DNNs. DR+SCANN starts the DNN synthesis from a seed DNN architecture. We refer to this architecture as the baseline model. The baseline model can be chosen from among the well-known architectures for certain datasets, e.g., ImageNet, or a well-performing fully-connected (FC) architecture. First, we use dimensionality reduction (DR) methods for the non-image datasets to reduce their feature size. This helps with network compression while improving its classification accuracy.
The second step of our methodology is its main part, called SCANN. SCANN starts DNN synthesis with a seed architecture. It uses three architecture-changing operations in multiple iterations to synthesize accurate and compact models. It focuses on the FC layers of the architecture and allows DNNs to grow connections and neurons based on the gradient information so that the model can be adapted to the task at hand. SCANN uses two different operations for network growth, namely connection growth and neuron growth. Then, SCANN prunes away insignificant connections in the architecture based on the magnitude information. Unlike previous grow-and-prune synthesis approaches [10], SCANN does not limit the architecture to the multi-layer perceptron (MLP) structure. By allowing any neuron to connect to any other neuron in the DNN architecture, SCANN allows skipped connections in the network. In addition, although previous grow-and-prune synthesis approaches allow the weights and connections to be learned during the training process, they do not allow a change in the number of artchitecture layers during training. SCANN removes this limitation, allowing it to derive better architectures.
We use SCANN and DR+SCANN to synthesize various compact DNNs for small, medium, and large datasets. We used the SCANN methodology to generate compact DNNs for MNIST [13] and ImageNet [1] datasets. We use DR+SCANN to generate compact DNNs for several non-image datasets. As we show later, DR+SCANN leads to drastic reductions in the number of parameters and computational cost of the model relative to the FC DNN baselines while also improving classification performance.
The major contributions of this work can be summarized as follows:
- •
We present SCANN, a grow-and-prune synthesis methodology, that yields compact and accurate feed-forward neural networks for datasets spanning small to large sizes. SCANN addresses a limitation of prior work that fixes the number of layers in the architecture prior to the training process.
- •
We use DR methods to mitigate the curse of dimensionality and improve the performance while compressing the network architecture.
- •
We propose a two-step DNN synthesis process, DR+SCANN, that combines DR with SCANN to learn very compact and accurate neural network models.
- •
We evaluate the performance of SCANN on MNIST and ImageNet datasets with various seed architectures. SCANN targets the FC layers of image-based architectures since these layers contain a large fraction of all parameters.
- •
We evaluate the performance of DR+SCANN on nine small to medium datasets and demonstrate to compression in network parameters with little to no drop in accuracy. We demonstrate that DR+SCANN yields DNNs that are very energy-efficient, while offering a similar accuracy to other methods. This opens the door for such DNNs to even be used in Internet-of-Things (IoT) sensors.
The rest of the article is organized as follows. Section 2 describes related work. Section 3 describes the SCANN and DR+SCANN synthesis methodologies in detail. Section 4 provides results of our evaluations. Section 5 provides a short discussion. Finally, Section 6 concludes the article.
2 Related Work
In this section, we review some of the previous work in two related areas: DR and automatic architecture synthesis.
2.1 Dimensionality Reduction
The high dimensionality of many datasets used in various applications of machine learning leads to the curse of dimensionality problem. Therefore, researchers have explored DR methods to improve the performance of machine learning models by decreasing the number of features. Traditional DR methods include Principal Component Analysis (PCA), Kernel PCA, Factor Analysis (FA), Independent Component Analysis (ICA), as well as Spectral Embedding methods. Some graph-based methods include Isomap [14] and Maximum Variance Unfolding [15]. FeatureNet [16] uses community detection in small sample size datasets to map high-dimensional data to lower dimensions. Other DR methods include stochastic proximity embedding, linear discriminant analysis , and t-distributed stochastic neighbor embedding [17]. A detailed survey of DR methods can be found in [18].
2.2 Automatic Architecture Synthesis
There are three different categories of automatic architecture synthesis methods that have been proposed by researchers: evolutionary algorithm, reinforcement learning algorithm, and structure adaptation algorithm.
2.2.1 Reinforcement Learning Algorithm
In a recent trend, reinforcement learning (RL) has been used to search for architectures in an automated flow [19]. This approach is known as neural architecture search (NAS). A typical NAS framework uses a controller based on recurrent neural networks to iteratively generate candidate architectures in the search process. Based on the performance of the candidate architectures, the RL controller gets updated in the next iteration. Zoph and Le [19] use a recurrent neural network as a controller to generate a string that specifies the network architecture. They use the performance of the generated network on a validation set as the reward signal to compute the policy gradient and update the controller. NASNet [20] yields a new search space that is transferable. It uses RL to find the best convolutional layers on the CIFAR-10 dataset and then uses these layers for the ImageNet dataset by stacking multiple copies each with its own parameters. RL-based approaches can also be used to design efficient DNN architectures for mobile platforms. MNasNet [21] uses this approach to achieve top-1 accuracy of on the ImageNet classification task with very low latency on the mobile platforms. Although RL-based architecture search approaches have been successful, this process remains computationally intensive.
2.2.2 Evolutionary Algorithm
The use of an evolutionary algorithm to select a DNN architecture dates back to 1989 [22]. One of the seminal works in neuroevolution is the NEAT algorithm [23], which uses direct encoding of every neuron and connection to simultaneously evolve the network architecture and weights through weight mutation, connection mutation, node mutation, and crossover. Recent years have seen extensions of the evolutionary algorithm to generate convolutional neural networks (CNNs). For example, Xie and Yuille [24] use a concise binary representation of network connections and demonstrate a comparable classification accuracy to previous human-designed architectures. It is also beneficial to combine efficient evolutionary search with various performance predictors to optimize architectural hyperparameters [25, 26]. FBNetV [27] adds the training recipe (i.e., training hyperparameters) to the evolutionary search process. As a result, the search process can find higher accuracy-recipe combinations.
2.2.3 Structure Adaptation Algorithm
Several previous works achieve compact and accurate neural networks through structure adaptation algorithms. One such method is network pruning, which has been used in several works [28, 10, 29, 30]. Structure adaptation algorithms can be constructive or destructive. Constructive algorithms start with a small neural network and grow it into a larger more accurate neural network. Destructive algorithms start with a large neural network and prune connections and neurons to get rid of redundancy while maintaining accuracy. NeST [10] is a network synthesis tool that combines both the constructive and destructive approaches in a grow-and-prune synthesis paradigm. However, its limitation is that both growth and pruning are performed at a specific DNN layer. Thus, network depth cannot be adjusted and is fixed throughout training. In the next section, we show this problem can be solved by synthesizing a general feed-forward network instead of an MLP architecture, allowing the DNN depth to be learned dynamically during the training process.
Several works have also proposed more efficient building blocks for CNN architectures [31, 32, 6]. They result in compact networks, with much fewer parameters, while maintaining or improving performance. Platform-aware search for an optimized DNN architecture has also been used in this area. Yin et al. [33] combine the grow-prune synthesis methodology with hardware-guided training to achieve compact long short-term memory cells.
Orthogonal to the above works, quantization has also been used to reduce computations in a network with little to no accuracy drop [34].
3 Methodology
In this section, we describe various parts of the proposed DNN synthesis methodology. First, we give an overview of our two-step DNN synthesis approach. Then, we discuss the SCANN synthesis methodology to learn both the weights and an efficient DNN architecture. We then explain our DR pre-processing step to not only reduce the number of features, but to improve the classification accuracy as well.
3.1 Framework Overview
Our DNN synthesis methodology covers both non-image and image datasets. For the non-image datasets, we use a two-step sequential method, which we refer to as DR+SCANN. We illustrate the block diagram of DR+SCANN framework in Fig. 1. For the image dataset, we use just the SCANN DNN synthesis framework, as shown in Fig. 2. In the DR step, we first modify the dataset by conducting dataset normalization and performing DR on dataset features. DR is aimed at alleviating the curse of dimensionality and increasing classification accuracy. As a result, we also obtain a smaller DNN architecture.
For the non-image datasets, we compare the DNN models designed by the DR+SCANN methodology with FC baselines obtained by training various DNNs (with different numbers of layers and different numbers of neurons per layer) and verifying their performance on the validation set. The DR step chooses the best DR method and feature compression ratio for each dataset. We also demonstrate that we can use a smaller FC baseline architecture and still improve its classification accuracy when the number of features is reduced. The process of neural network compression with DR is explained in Section 3.2.
For the image datasets, MNIST and ImageNet in our experiments, we use well-known DNN models as the seed architectures. We use the SCANN synthesis methodology for designing efficient DNNs. SCANN combines gradient information to grow connections, activation information to grow neurons, and magnitude information to prune insignificant connections in the FC layers of the network. By allowing skipped connection between neurons in the architecture, SCANN addresses the limitation of prior work that requires the DNN depth to be fixed prior to the training phase. These architecture-changing operations are used in three different training schemes. This process is explained in Section 3.3.
3.2 Dimensionality reduction
In this step, we first normalize the data. Data normalization generally leads to higher accuracy and better noise tolerance. We use range normalization for this purpose.
[TABLE]
This scales each input data instance into the [0,1] range. Next, we use DR to reduce the number of features in the dataset. An -dimensional dataset is mapped onto an -dimensional space, , using various DR methods. We explore nine such methods, including four random projection (RP) methods.
Dimensionality reduction with RP is based on the Johnson-Lindenstrauss lemma [35, 36]. The essence of this lemma is that sufficiently high-dimensionality data points can be projected onto a suitable lower dimension, while approximately maintaining inter-point distances. More precisely, this lemma shows that the distance between the points changes only by a factor of , when they are randomly projected onto the subspace of dimensions, for any .
RP uses a projection matrix to compute the features in the lower dimension. The RP matrix can be generated in several ways. Here, we discuss four RP matrices that we used in our implementation. One approach is to generate using a Gaussian distribution. In this case, the entries are i.i.d. samples drawn from a Gaussian distribution . Another RP matrix can be obtained by sampling entries from . These entries are shown below.
[TABLE]
Achlioptas [37] proposed several other sparse RP matrices. Two of these proposals are as follows, where entries ’s are independent random variables that are drawn based on the following probability distributions:
[TABLE]
[TABLE]
The other DR methods that we use are PCA, FA, Isomap, ICA, and Spectral Embedding. Implementations of these methods are obtained from the Scikit-learn machine learning library [38].
DR maps the dataset into a vector space of lower dimension. As the number of features reduces, the number of neurons in the input layer of the neural network decreases accordingly. However, since the dataset dimension is reduced, one might expect the task of classification to become easier. This means that we may be able to use a smaller DNN architecture, in general. We show that we can indeed reduce the number of neurons in all layers, not just the input layer. In fact, we show that we can use a DNN architecture with the number of neurons in each layer reduced by the same feature compression ratio obtained in the DR step, except for the output layer. We use this ratio to show that DR can increase classification accuracy while enabling the use of a smaller DNN architecture. Fig. 3 shows an example of the process of reducing the number of neurons in the architecture. We refer to this model as the DR model.
Algorithm 1 summarizes the process of dataset DR and architecture compression. We obtain the reduced-dimension dataset for all the DR methods and various feature compression ratios. Dimensionality reduction methods perform differently on various datasets. We also use early stopping to terminate DR methods that do not perform well on the validation set. Furthermore, we stop reducing the DR compression ratio when the performance drops significantly. As we reduce the number of features, we reduce the number of neurons in each layer of the initial architecture with the same ratio. Note that the number of neurons in each layer could be reduced using different ratios. However, the combinatorial explosion of choices makes navigation of this larger search space computationally prohibitive. We show later that uniform reduction across layers works very well in practice. Then, we train the new architecture using the reduced-dimension training set and evaluate it on the reduced-dimension validation set. Finally, we select the architecture with the highest validation accuracy and record its test accuracy. The output of this algorithm is the best performing architecture (on the validation dataset), its corresponding test accuracy, and the corresponding reduced-dimension dataset.
3.3 SCANN Synthesis Methodology
Next, we explain the SCANN methodology that leverages both destructive and constructive architecture synthesis approaches through a grow-and-prune synthesis paradigm. As a result, the synthesis cost of this process is significantly reduced compared to RL-based approaches. SCANN can also be used in conjunction with the DR process, as explained in Section 3.2. DR+SCANN works on the reduced-dimension dataset whereas SCANN works on the original dataset.
We first propose a technique to address the limitation of prior work that requires the number of layers of DNN architecture to be fixed prior to the training process. Then, we introduce three basic architecture-changing techniques that enable the synthesis of an optimized feed-forward network architecture. Finally, we describe three training Schemes, A, B, and C, that can be used to learn the weights and connections in the network during the training process. Each of these training schemes uses a different approach to synthesizing efficient DNN architectures. Scheme A is a constructive approach that starts from a small network and iteratively grows the network to a larger one. On the other hand, Schemes B and C are based on destructive synthesis that starts from a larger network and iteratively prunes the architecture to a smaller one.
3.4 Depth Change
To address the problem of having to fix the depth of the DNN (number of layers in the architecture) prior to the training process, we adopt a general feed-forward architecture instead of an MLP structure. Specifically, in this setting, depth is determined by how hidden neurons are connected and thus can be changed through the rewiring of hidden neurons. As shown in Fig. 4, depending on how the hidden neurons are connected, they can form one, two, or three hidden layers. In addition, by allowing skipped connections in the architecture, we address the limitation of MLP structures in learning the architecture during the training process.
3.5 SCANN: Overall Workflow
The overall workflow for architecture synthesis is shown in Algorithm 2. The synthesis process iteratively alternates between architecture change and weight training. Thus, the network architecture evolves along the way. The growth phase uses gradient information to gradually grow connections and activation information to grow neurons, in order to achieve the desired accuracy. In the pruning phase, the magnitude information is used to remove the redundant connections. After a specified number of iterations, the checkpoint that achieves the best performance on the validation set is output as the final network.
Next, we first elaborate on the three basic architecture-changing operations and then introduce three different training schemes based on how the architectures evolve. The process of applying architecture-changing operations in the flow of Algorithm 2 differs in each training scheme. In general, we found that using iterative growth-and-pruning enables both higher accuracies and compression ratios. Fig. 5 shows the accuracy versus training epochs when applying SCANN to the MobileNetV2 architecture for the ImageNet dataset. As can be seen, pruning leads to a drop in accuracy after the growth operation. However, applying growth and pruning over multiple iterations enables the architecture to recover from the loss in performance.
3.6 Basic Architecture-changing Operations
Three basic operations, connection growth, neuron growth, and connection pruning, are used to evolve a feed-forward network architecture through multiple iterations. Fig. 6 shows a simple example in which an MLP architecture with one hidden layer evolves into a non-MLP architecture with two hidden layers, with a sequence of basic operations mentioned above.
Next, we describe these three operations. We denote the th hidden neuron as , its activity as , and its preactivity as , where and is the nonlinear activation function. We denote the depth of by and the loss function by . Finally, we denote the connection between and , where , as . In our implementation, we use a mask-based approach to ignore the dormant connections.
3.6.1 Connection Growth
The connection growth algorithm greedily adds connections between neurons that are unconnected. The initial weights of all newly added connections are set to 0. Depending on how connections are added, we use two different methods, as shown in Algorithm 3.
- •
Gradient-based growth: Gradient-based growth was proposed by Dai et al. [10]. It adds connections that tend to reduce the loss function significantly. Suppose two neurons and are not connected and , then gradient-based growth adds a new connection if is large. We evaluate the for all the dormant connections and activate the ones with the largest values. This is done based on a large data batch . We use a predefined threshold to activate the dormant neurons. This threshold can be chosen based on a certain percentage of elements in the computed gradient matrix.
The intuition behind this approach is the Hebbian theory that states ”neurons that wire together fire together” [39]. The connections activated based on this theory would have a strong correlation between presynaptic and postsynaptic cells. Therefore, this translates to the large values.
- •
Full growth: Full growth restores all possible connections to the network.
3.6.2 Neuron Growth
Neuron growth adds new neurons to the network, thus gradually increases the network size. Algorithm 4 shows the process of neuron growth. By drawing an analogy from biological cell division, neuron growth can be achieved by duplicating an existing neuron. To do this, we process a large batch of data through the network and compute the activation value of the neurons in the architecture. We choose the neurons with the highest activation values for duplication. To break the symmetry, random noise is added to the weights of all the connections related to this newly added neuron.
3.6.3 Connection Pruning
Connection pruning disconnects previously connected neurons and reduces the number of network parameters. If all connections associated with a neuron are pruned, then the neuron is removed from the network. We adopt a widely-used method [28, 10] to prune connections with small magnitude, as shown in Algorithm 5. The rationale behind it is that since small weights have a relatively small influence on the network, DNN performance can be restored through retraining after pruning.
3.7 Training Schemes
In practice, depending on how the initial network architecture and basic operations in Step (a) of Algorithm 2 are chosen, we adopt three training schemes in our experiments, as explained next. These training schemes enable us to synthesize different architectures with various structures, and yield compact and accurate models for each dataset.
3.7.1 Scheme A
Scheme A is a constructive approach, where we start with a tiny network, and gradually increase the network size. This can be achieved by performing connection and neuron growth more often than connection pruning or carefully selecting the growth and pruning rates, such that each growth operation grows a large number of connections and neurons, while each pruning operation prunes a small number of connections.
To implement this scheme, we specify the initial number of hidden neurons (the minimum number of hidden neurons) in the architecture, as well as the maximum allowed number of hidden neurons in the final model. This scheme starts from the initial small number of hidden neurons, and applies connection growth, neuron growth, and connection pruning in this order. The neuron growth phase each time adds a certain number of neurons to the architecture (e.g., 5 or 10 neurons). In the connection growth process, we use gradient-based growth to add a certain percentile top connections (e.g. top ) to the network. Connection pruning is used to prune the network after each growth phase.
3.7.2 Scheme B
Scheme B is a destructive approach, where we start with a large network and make the network smaller by iteratively pruning connections. One approach for accomplishing this [28, 10] is based on iteratively pruning a small number of connections and then training the weights. This gradually reduces the network size and finally results in a small network after many iterations. We use a different method in Scheme B. Rather than pruning the network gradually, we prune the network aggressively to a tiny size. However, to recover the performance, we repeatedly prune the network and then grow the network back, rather than just perform gradual pruning and retraining.
To implement this scheme, we start with a network architecture with a large number of hidden neurons. We consider the initial point as the maximum allowed number of hidden neurons in the architecture. We apply iterative gradient-based connection growth and magnitude-based connection pruning to train both the architecture and weights. For the connection growth process, we use the gradient-based growth to add a certain top percentile (e.g., to ) connections to the network. Subsequently, we use aggressive connection pruning to reduce the number of connections drastically. In addition, we train the architecture for to epochs after applying each architecture-changing operation. We perform these operations for several (-) iterations.
3.7.3 Scheme C
Similar to Scheme B, Scheme C is also a destructive approach. The main difference is the use of MLP architectures in Scheme C. This can be achieved by adjusting the connection growth algorithm to only allow connections between adjacent layers and not allow skipped connections. Scheme C can be viewed as an iterative version of the dense-sparse-dense technique proposed in [40].
To implement Scheme C, we start with an FC MLP architecture and apply connection pruning to drastically reduce the number of connections in the network. Then, in several iterations, we apply full growth to recover all the connections in the network, followed by connection pruning to reduce network size.
Fig. 7 shows examples of the initial and final architectures for each scheme. Both Schemes A and B evolve general feed-forward architectures, thus allowing network depth to be learned during training. Scheme C evolves an MLP structure, thus keeping the number of layers fixed.
4 Experimental Results
In this section, we evaluate the performance of DR+SCANN and SCANN on nine small to medium size datasets, as well as on MNIST and ImageNet datasets. Table II shows the characteristics of the nine datasets. For such non-image datasets, we compare our synthesized DNN model with the FC DNN architecture that performs the best on the validation set. For the MNIST and ImageNet datasets, we compare our synthesized models with various well-known benchmark architectures.
The evaluation results are divided into two parts. Section 4.1 discusses results obtained by SCANN when applied to image datasets: MNIST and ImageNet. As we will see, SCANN generates neural networks with similar classification accuracy relative to well-known architectures, but with much fewer parameters and FLOPs.
In Section 4.2, we present experimental resuls for DR, SCANN, and DR+SCANN methodologies, on nine non-image datasets. We demonstrate that the DNNs generated by SCANN and DR+SCANN are very compact and energy-efficient while maintaining performance. These results open up opportunities to use SCANN-generated DNNs in energy-constrained edge devices and IoT sensors.
4.1 Experiments with MNIST and ImageNet
MNIST is a well-studied dataset of handwritten digits. It contains 60000 training images and 10000 test images. We set aside 10000 images from the training set as the validation set. We first target the Lenet-5 Caffe model [13]. The LeNet-5 Caffe model contains two convolutional layers with 20 and 50 filters, and also one FC hidden layer with 500 neurons.
We use the stochastic gradient descent (SGD) optimizer with a learning rate of 0.03, momentum of 0.9, and weight decay of 1e-4. For schemes A and B, the feed-forward part of the network is learnt by SCANN, while the convolutional part of the architecture is kept the same. For scheme A, we start from 300 hidden neurons in the hidden layer, and set the maximum number of neurons to 500. At the beginning, we use connection pruning to prune 95 percent of the connections in the network. Subsequently, we apply connection growth, neuron growth, and connection pruning in several iterations. The neuron growth operation duplicates the five neurons in the architecture with the highest activation values. The connection growth activates 35 percent of all connections and connection pruning prunes 25 percent of the existing connections. In Scheme B, the best results correspond to 400 hidden neurons in the feed-forward part. We iteratively perform a sequence of connection pruning such that 19.3K connections are left in the architecture, and connection growth such that 90 percent of all connections are restored. In Scheme C, we start the feed-forward part of the network with the FC part of the baseline architecture. We iteratively prune the network to its final number of parameters and then use connection growth to restore all connections.
Table III summarizes the results. The baseline error rate is 0.72% with 430.5K parameters. The most compressed model generated by SCANN contains only 9.3K parameters (with a compression ratio of 46.3 over the baseline), achieving the same 0.72% error rate when using Scheme C. Scheme A obtains the best error rate of 0.68%, however, with a lower compression ratio of 2.3. For a fair comparison, we implement the method given in [28] on the same data split.
Next, we study the impact of the seed architecture on GPU time (Nvidia Tesla P) for growth and pruning operations on the MNIST dataset. Fig. 8 demonstrates this trend for different numbers of maximum hidden neurons in the architecture (Scheme B was used in this case). The growth operation is more computationally intensive than the pruning operation. This is because while magnitude-based pruning only needs the forward pass, gradient-based growth needs both forward and backward passes on the network. In addition, as the number of hidden neurons in the architecture increases, the GPU time of both operations increases, as expected.
We now use the feed-forward architecture proposed by Ciresan et al. [41] as the baseline architecture for SCANN synthesis. This architecture has six layers with 2500, 2000, 1500, 1000, 500, and 10 neurons, respectively. As shown in Table IV, this baseline reduces the error rate to just 0.35% through a dramatic increase in the number of parameters to 12.0M. It represents the state-of-the-art test accuracy for an FC DNN on the MNIST benchmark. It consumes 23.9M FLOPs. Thus, this network is computationally intensive and has significant memory requirements. We use this architecture as the starting point for SCANN Scheme C. We use the SGD optimizer with an initial learning rate of e- and gradually decrease it to e-. We use connection pruning to remove percent of the connections in the network, and connection growth to restore all the connections. Through iterative growth and pruning, we are able to synthesize a much more compact architecture. We are able to reduce the number of parameters by and computational cost by with only a increase in error rate.
To demonstrate the applicability of SCANN to different architectures and on different datasets of various sizes, we also use SCANN to synthesize DNNs for the ImageNet dataset [1]. Table V shows the results of our experiments. For the ImageNet experiments, we use the SGD optimizer with an initial learning rate of 0.05 and gradually decrease it to 1-4. The weight decay is set to e-. We initialize the grow-and-prune process with VGG-16 [3] and MobileNetV2 [6] architectures. VGG-16 consists of 13 convolutional layers, 5 max-pooling layers, and 3 FC layers. We use SCANN to optimize the FC layers, where most of the parameters reside, to learn the connections and weights in the training process. VGG-16 consists of 138.4M parameters. Its FC layers contribute to 123.6M parameters of the architecture. Thus, reducing the number of parameters in the FC layers can have a significant impact on the model size. Our best result is obtained using SCANN Scheme B. We set the number of hidden neurons in the architecture to 4000. Initially, we prune 95 percent of the connections. Next, we use connection growth to grow 30 percent of the connections, followed by connection pruning to leave only 2.5M connections in the feed-forward part of the architecture. As a result, SCANN reduces the number of parameters to M for an 8.0 reduction. In addition, SCANN reduces the top-1 error rate by 1.7% to . However, since most of the computational cost of a CNN architecture is in its convolution operations, SCANN is not able to reduce the FLOPs much.
MobileNetV2 is an architecture optimized for mobile devices. Hence, it has reduced computational cost. Its FC layer contains of all its parameters. Keeping the rest of the architecture fixed, we use SCANN Scheme C to optimize the FC layer. We use connection pruning to remove 800K connections in the FC layer. Subsequently, we use connection growth to restore all the connections. Using iterative growth and pruning, we can reduce the number of parameters by 1.3 at the cost of a slight 0.2% increase in the error rate.
While RL-based architecture search approaches, such as NASNet [20], consume around 2000 GPU days for the ImageNet dataset, SCANN requires around 20 GPU days for optimizing the FC layers of a given architecture.
4.2 Experiments with Other Datasets
To demonstrate the capability of SCANN and DR+SCANN for synthesizing accurate and compact neural network models for various non-image datasets, we experiment with nine datasets from the UCI machine learning repository [42] and Statlog collection [43]. Next, we present evaluation results on these datasets.
SCANN experiments are based on the Adam optimizer with a learning rate of and weight decay of e-. We compare results obtained by DR+SCANN with those obtained by only applying SCANN and also DR without using SCANN in a secondary compression step. Table VI shows the classification test accuracy obtained. The MLP column shows the accuracy of the best MLP baseline found with the help of the validation set. For all the other methods, we present two columns, the left of which shows the highest accuracy (H.A.) achieved whereas the right one shows the result for the most compressed (M.C.) network. Furthermore, in the DR columns, the DR method employed is shown in parentheses. In the DR columns, whereas the M.C. and H.A. columns employ the same DR method, they may use different DR ratios. Table VII shows the number of parameters in the network for the corresponding columns in Table VI.
SCANN-generated networks show improved accuracy for six of the nine datasets, as compared to the MLP baseline. The accuracy increase is between to . These results correspond to networks that are to smaller than the baseline architecture. Furthermore, DR+SCANN shows improvements on the highest classification accuracy in five out of the nine datasets, as compared to SCANN-generated results. In addition, SCANN yields DNNs that achieve the baseline accuracy with fewer parameters in seven out of the nine datasets. For these datasets, the results show a parameter compression ratio between to . Moreover, as shown in Tables VI and VII, combining DR with SCANN helps achieve higher compression ratios. For these seven datasets, DR+SCANN can meet the baseline accuracy with a to smaller network. This shows a significant improvement over the compression ratio achievable by just using SCANN.
We also report the performance of applying DR without the benefit of the SCANN synthesis step. While these results show improvements, DR+SCANN can be seen to have much more compression power, relative to when DR and SCANN are used separately. This points to a synergy between DR and SCANN.
Although classification performance is of great importance, in applications where computing resources are limited, e.g., in battery-operated devices, energy efficiency might be a very important concern. Thus, the energy performance of the models should also be taken into consideration in such cases. To evaluate the energy performance, we use the energy analysis method proposed in [44], where the energy consumption for inference is calculated based on the number of multiply-accumulate (MAC) and comparison operations and the number of SRAM accesses. For example, a multiplication of two matrices of size and would require MAC operations and SRAM accesses. In this energy model, a single MAC operation, SRAM access, and comparison operation implemented in a 130 CMOS process (which may be an appropriate technology for many IoT sensors) consumes 11.8 , 34.6 , and 6.16 , respectively. Table VIII shows the energy consumption estimates per inference for the models presented in Tables VI and VII. Note that energy consumption does not include dataset DR. However, some of the DR methods, like RP, just require a single matrix-vector multiplication. Hence, such methods do not have much energy overhead.
As can be seen, SCANN models are significantly more energy-efficient compared to FC baselines. In addition, DR+SCANN can be seen to have the best overall energy performance. Except for the Letter dataset (for which the energy reduction is only 17 percent), the compact DNNs generated by DR+SCANN consume one to four orders of magnitude less energy than the baseline MLP models. Thus, SCANN and DR+SCANN synthesis methodologies are suitable for heavily energy-constrained devices, such as IoT sensors.
5 Discussion
The advantages of SCANN are derived from its core benefit: the network architecture is allowed to dynamically evolve during training. This benefit is not directly available in several other existing automatic DNN architecture synthesis techniques, such as the evolutionary and RL-based approaches. In those methods, a new architecture, whether generated through mutation and crossover in the evolutionary approach or from the controller in the RL approach, needs to be fixed during training and trained from scratch again when the architecture is changed. However, human learning is incremental. Our brain gradually changes based on the presented stimuli. For example, studies of the human neocortex have shown that up to 40 percent of the synapses are rewired every day [45]. Hence, from this perspective, SCANN takes inspiration from how the human brain evolves incrementally. SCANN’s dynamic rewiring is easily achieved through connection growth and pruning.
Comparisons between SCANN and DR+SCANN show that the latter results in a smaller network in nearly all the cases. This is due to preceding SCANN with DR. By mapping data instances into lower dimensions, it reduces the number of neurons in each layer of the DNN, without degrading performance. This enables SCANN to start with a significantly smaller DNN. However, a limitation of SCANN is that it can only evolve feed-forward networks. How to extend SCANN to the convolutional layers in CNNs and recurrent neural networks is the focus of our future work.
6 Conclusion
In this article, we proposed a synthesis methodology that can generate compact and accurate neural networks. It solves the problem of having to fix the depth of the network during training that prior synthesis methods suffer from. It is able to evolve an arbitrary feed-forward network architecture with the help of three basic operations: connections growth, neuron growth, and connection pruning. Furthermore, by combining DR with SCANN synthesis, we showed significant improvements in the network compression power of this framework. Experiments on MNIST and ImageNet image datasets, and several other small to medium non-image datasets, showed that SCANN and DR+SCANN can provide a good tradeoff between accuracy, model compression, and energy efficiency in applications where computing resources are limited.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Image Net: A large-scale hierarchical image database,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition , 2009, pp. 248–255.
- 2[2] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Technical report, University of Toronto , 2009.
- 3[3] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learning Representations , Y. Bengio and Y. Le Cun, Eds., 2015.
- 4[4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition , 2015, pp. 1–9.
- 5[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition , 2016, pp. 770–778.
- 6[6] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobile Net v 2: Inverted residuals and linear bottlenecks,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition , 2018, pp. 4510–4520.
- 7[7] S. Hassantabar, N. Stefano, V. Ghanakota, A. Ferrari, G. N. Nicola, R. Bruno, I. R. Marino, and N. K. Jha, “Covid Deep: SARS-Co V-2/COVID-19 test based on wearable medical sensors and efficient neural networks,” ar Xiv preprint ar Xiv:2007.10497 , 2020.
- 8[8] S. Hassantabar, M. Ahmadi, and A. Sharifi, “Diagnosis and detection of infected tissue of COVID-19 patients based on lung X-ray image using convolutional neural network approaches,” Chaos, Solitons & Fractals , vol. 140, p. 110170, 2020.
