Learning Sparse Neural Networks via $\ell_0$ and T$\ell_1$ by a Relaxed Variable Splitting Method with Application to Multi-scale Curve Classification
Fanghui Xue, Jack Xin

TL;DR
This paper introduces a relaxed variable splitting method to sparsify convolutional neural networks using $0$ and Tb1 penalties, achieving high accuracy with significant weight reduction, especially in complex curve classification tasks.
Contribution
The paper presents a novel optimization approach for neural network sparsification using $0$ and Tb1 penalties, demonstrating effective pruning in CNNs for complex curve classification.
Findings
Achieved over 99% test accuracy with 86% sparsity in fully connected layer.
Comparable sparsity and accuracy with both $0$ and Tb1 penalties.
Effective classification of shaky vs. regular fonts and handwriting.
Abstract
We study sparsification of convolutional neural networks (CNN) by a relaxed variable splitting method of and transformed- (T) penalties, with application to complex curves such as texts written in different fonts, and words written with trembling hands simulating those of Parkinson's disease patients. The CNN contains 3 convolutional layers, each followed by a maximum pooling, and finally a fully connected layer which contains the largest number of network weights. With penalty, we achieved over 99 \% test accuracy in distinguishing shaky vs. regular fonts or hand writings with above 86 \% of the weights in the fully connected layer being zero. Comparable sparsity and test accuracy are also reached with a proper choice of T penalty.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10| a | Penalty | Sparsity (%) | Accuracy (%) | ||
|---|---|---|---|---|---|
| 0.0005 | 0.1 | 0 | 86.1 | 99.4 | |
| 0.01 | TL1 | 87.6 | 99.0 | ||
| 0.1 | TL1 | 85.8 | 99.7 | ||
| 1 | TL1 | 78.1 | 99.3 | ||
| 100 | TL1 | 82.0 | 99.3 | ||
| 76.5 | 99.0 |
| a | Penalty | Sparsity (%) | Accuracy (%) | ||
|---|---|---|---|---|---|
| 0.0005 | 0.1 | 0 | 90.2 | 99.9 | |
| 0.01 | TL1 | 83.5 | 99.1 | ||
| 0.1 | TL1 | 87.6 | 99.8 | ||
| 1 | TL1 | 74.9 | 99.9 | ||
| 100 | TL1 | 75.0 | 99.9 | ||
| 74.6 | 99.6 |
| Sparsity (%) of scale | |||||||
| a | Algorithm | Accuracy (%) | |||||
| 0.01 | RVSM | 99.7 | 96.0 | 91.3 | 88.6 | 84.9 | 99.5 |
| 0.01 | SGD | 99.9 | 99.9 | 45.9 | 5.44 | 96.7 | |
| 100 | RVSM | 99.9 | 97.5 | 92.7 | 88.5 | 80.3 | 99.3 |
| 100 | SGD | 99.9 | 99.7 | 48.1 | 6.68 | 99.0 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Signal Denoising Methods · Medical Imaging and Analysis · Advanced Image Processing Techniques
11institutetext: Department of Mathematics, UC Irvine, Irvine, CA 92697, U.S.A.
11email: {fanghuix,jack.xin}@uci.edu
Learning Sparse Neural Networks via and T
by a Relaxed Variable Splitting Method
with Application to Multi-scale Curve Classification
Fanghui Xue
Jack Xin
Abstract
We study sparsification of convolutional neural networks
(CNN) by a relaxed variable splitting method of and transformed- (T) penalties, with application to complex curves such as texts written in different fonts, and words written with trembling hands simulating those of Parkinson’s disease patients. The CNN contains 3 convolutional layers, each followed by a maximum pooling, and finally a fully connected layer which contains the largest number of network weights. With penalty, we achieved over 99 % test accuracy in distinguishing shaky vs. regular fonts or hand writings with above 86 % of the weights in the fully connected layer being zero. Comparable sparsity and test accuracy are also reached with a proper choice of T penalty.
Keywords:
Convolutional Neural Network Sparsification Multi-Scale Curves Classification.
1 Introduction
Sparsification of neural networks is one of the effective complexity reduction methods to improve efficiency and generalizability [4, 3]. In this paper, we sparsify convolutional neural networks (CNN) for classifying curves with multi-scale structures. Such curves arise in hand writings of people with neurological disorders e.g. Parkinson disease (PD) patients, and in neuropsychological exams. Distinguishing hand writings of normal and PD subjects computationally will greatly help diagnosis and reduce physicians’ workload in evaluations.
People with PD tend to lose control of their hands, and their writing or drawing shows oscillatory behavior as shown in Fig. 2, a century old image available online. Such oscillatory features can be learned during CNN training. Since we do not have large amount of PD hand writings, we shall generate on the computer a large number of oscillatory shapes that mimic shaky writings of PD subjects. Indeed, we found that CNN is quite successful for this task and can reach accuracy as high as 99 % on our synthetic data set with three convolution layers and one fully connected layer as shown in Fig. 1. However, we also found that there is a lot of redundancy in the weights of the trained CNNs, especially in the fully connected layer where we aim to significantly sparsify the network weights with minimal loss of accuracy.
Since the natural sparsity promoting penalty is discontinuous, we shall adopt the relaxed variable splitting method (RVSM, [3]) for network sparsification. Even though Lipschitz continuous penalties such as and transformed- [2, 8] are almost everywhere differentiable, the splitting approach [3] is more effective for enforcing sparsity than directly placing a penalty function inside the stochastic gradient descent (SGD) algorithm. The RVSM is also much simpler than the statistical regularization approach in [4]. A systematic comparison with [4] will be conducted elsewhere.
The rest of the paper is organized as follows. In section 2, we review RVSM for , transformed-, and penalties and present a convergence theorem. A new critical point condition is introduced for the limit. We apply RVSM to CNNs for multi-scale curve classification. In section 3, we describe our data set, CNN architecture and training, the CNN performance in terms of network accuracy and sparsity. We compare weight distributions of sparse and non-sparse networks. Concluding remarks are in sections 4.
2 Sparse Neural Network Training Algorithm
When training neural networks, one minimizes a penalized objective function of the form:
[TABLE]
where is a standard loss function in neural network models such as cross entropy [7], and is a penalty function. In SGD, the expected loss is replaced by an empirical loss over batches of training samples [7]. In this section, we shall consider the expected loss function which has better regularity than the empirical loss functions [6], and is more conducive to analysis. In the actual training, SGD and the sample averaged empirical loss function will be implemented. The standard penalty is norm, also known as weight decay. However, penalty cannot reduce the number of redundant parameters, resulting in a network with on the order of millions of nonzero weights. Thus we turn to penalty, which produces zero weights during training [4], however leads to a non-convex discontinuous optimization problem. In [4], a statistical approach is proposed to regularized . In this paper, we utilize the Relaxed Variable Splitting Method (RSVM) studied in [3] for a neural network regression problem. RSVM is much simpler to state and implement than [4]. To this end, let us consider the following objective function for parameter :
[TABLE]
Let be the learning rate. We minimize with the RVSM algorithm below where the step is thresholding and the step is gradient descent followed by a normalization:
The main theorem of [3] guarantees the convergence of RVSM algorithm under some conditions on the parameters and initial weights in case of one convolution layer network and Gaussian input data. The latter conditions are used to prove that the loss function has Lipschitz gradient away from the origin. Assuming that the Lipschitz gradient condition holds for , we adapt the main result of [3] into:
Theorem 2.1.
Suppose that is bounded from below, and satisfies the Lipschitz gradient inequalities: , and , with , for some positive constants , , and . Then there exists a positive constant so that if , the Lagrangian function is descending and converging in , with of RVSM algorithm satisfying as , and subsequentially approaching a limit point .
For the penalty, our objective function (the Lagrangian) becomes
[TABLE]
In this case, we simply obtain
[TABLE]
where is the hard-thresholding operator [1]. On each component
[TABLE]
For the case, it is also clear that
[TABLE]
where is the soft-thresholding operator [2]
[TABLE]
We also consider the transformed (TL1) penalty [8], which nicely interpolates the and penalties:
[TABLE]
to each component of a vector, where is a positive parameter. It is clear that
[TABLE]
By solving the problem with TL1 penalty, we can also get a thresholding operator in closed form [8]:
[TABLE]
where
[TABLE]
and . Here the parameter depends on as:
[TABLE]
Remark 1.
It follows from the Theorem above that the limit point satisfies the equilibrium equations for the , and transformed- penalties respectively as:
[TABLE]
The system (2.5) serves as a novel “critical point condition”. This is particularly useful in the case where the Lagrangian function is discontinuous in .
3 Experimental Results
We apply the RVSM algorithm to convolutional neural networks to see how it brings about a sparse network. In the following experiment, we consider a convolutional neural network of 3 layers and a data set of binary images. What we care about is the percentage of the weights which are zero after training the sparse network. Many of the algorithms can result in a sparsity of over , which means only less than of the parameters contribute to the model. This makes our model far more efficient than the original one without regularization.
In order to find out how the weights are distributed in each layer, we go through the structure of the network. Figure 1 shows the number of nodes in each layer, from which we can simply calculate the number of weights needed to connect the nodes.111When generating the figure, we used a tool by Alex Lenail available at http://alexlenail.me/NN-SVG/LeNet.html.We apply 32 filters to the initial image to get the first convolutional layer, which results in weights. Similarly, each of the second and the third convolutional layer contains weights, if we apply 32 filters again. After each convolutional layer, we add one max pooling layer with a filter and a stride of 2. The dimension of each image is not changed after each convolution, since we have applied padding. But it is reduced by a half on both the width and the height after max pooling because of a stride of 2. Thus the dimenson of the image is reduced from to , to and finally to . So this produces weights when constructing a dense layer of 128 nodes. Finally, weights are used to connect the dense layer to the output layer of 2 nodes, if our goal is to classify the images into two categories. From the above discussion, we notice that of the weights are concentrated to the dense layer. We will see that most of them contribute nothing to the model after we train the sparse network.
The first data set we use is the images of the handwritten alphabet by Parkinson’s disease (PD) patients and normal handwritten alphabet. We know that many PD patients may suffer from tremors in their daily life and work. One remarkable feature is that the words they write can be much shakier than the normal, which can be used to distinguish a PD patient during diagnosis. Figure 2 222https://en.wikipedia.org/wiki/Micrographia_(handwriting) shows one real example of handwritten sentence by a PD patient.
From our point of view, these two writing styles – normal vs. shaky – can be treated as two fonts. There is one Parkinson’s font available on the internet 333https://www.dafont.com/parkinsons.font, which contains the whole alphabet of the 52 uppercase and lowercase letters. We simulate a training set of 5,000 observations and a test set of 1,000 observations by adding some rotations, affine transformations and elastic distortions [5]. As we have mentioned, this is a data set of binary images, of which some samples are shown in Figure 3. Though our model is used to distinguish the letters written by a Parkinson’s disease patient in this single experiment, it can be simply applied to classify any other fonts.
As most of the redundancy appears in the dense layer, we apply the threshold step of the algorithm to the weights in dense layer only. This is because if we use the same and in all the layers, the proportion of zero weights in the convolutional layers might be high, where the zero weights can indeed grade the model. Compared to the dense layer of 700,000 weights, there is not much freedom to modify the convolutional layer of 10,000 weights. Too much sparsity leads to a sizable loss of accuracy.
In our models, we have the freedom to set the thresholding parameters, namely , and . A higher threshold usually means more sparsity, since more weights are forced to zero by the threshold. From the formula (2.1) - (2.2) for the and penalties, it is clear that the larger is and the smaller is, the higher the threshold will be. Given the same thresholding parameter , the model may result in a sparser model than , since its threshold is a square root of , which is higher. From the formula (2.3) - (2.4) for the TL1 penalty, the smaller is, the higher the threshold is. As discussed in the previous section, when goes to infinity, TL1 becomes . When goes to [math], it becomes . So as to achieve more sparsity, we may choose a small .
Our algorithm converges quickly after a few iterations. In most of the cases, it obtains an accuracy of and a sparsity of after 10 epochs. The accuracy soon goes up to within 20 epochs, while some models achieve a sparsity of around eventually. Figure 4 shows the convergence of the training algorithm.
Table 1 shows our results of sparsity and testing accuracy. It verifies what we discussed on the thresholding parameter. That is, when the threshold grows higher, the sparsity also grows correspondingly. When is less than 0.1, we achieve a sparsity of , while the accuracy remains high. The key point should be noticed is that these sparse networks achieve almost the same, or even better accuracy than the non-sparse model. Thus we affirm that around of the parameters are redundant, as they hardly contribute to the accuracy of the model.
Another data set we consider is the images of normal vs. shaky planar shapes like triangles and quadrangles (not necessarily convex). It can be viewed as another demonstration of PD patients’ handwriting, as what they draw are somehow shaky, likewise the letters they write. This data set of binary images is simulated by adding random noise to the normal planar shapes. Figure 6 shows some sample images of our shapes. The results on this data set are similar to those of the first data set, as shown in Table 2. So RVSM also achieves high accuracy and sparsity on multi-scale planar curve data.
More properties of our sparse networks are as follows. First, there is a remarkable difference in distributions of the weights between the sparse and non-sparse models. For the sparse model, most of the weights are zero, while the rest are very close to zero. So its distribution looks like a vertical line plus some noise on the interval close to zero. In our example of non-sparse model, it also has a peak at zero. However, very few weights are exactly zero. Many of them are merely close to zero, while a large proportion are far away from zero. What’s more, the distribution of this non-sparse model seems to be bell shaped. The distributions are shown in Figure 6, where the weights are normalized for better viewing.
What we also notice is that, RVSM performs much better than applying SGD directly to the TL1 penalized loss functions. As shown in Table 4, most of the normalized weights in the SGD model are distributed between and . It seems there is no apparent criterion to judge if a weight of should be set to zero or it does contribute to the network. However, for the RVSM method when , it is clear that of the weights are greater than and of the weights are less than . There is a significant gap between the two scales of and , which makes it reasonable to set all the weights less than to zero. This leads to a network of sparsity. Another point worth mentioning is that applying SGD directly to the penalized loss function may hurt the accuracy a lot at , resulting in accuracy for the model. This is because when is small, the penalized term behaves like , which renders the objective function nearly singular. RVSM resolves this issue by making the penalty implicit to a thresholding process, which gives an accuracy of .
Table 4 shows another interesting phenomenon. Since the weights are randomly initialized with mean zero, there is roughly even split of plus/minus signs in all layers. At the end of training, we counted the number of sign changes in the kernel of each convolutional layer, and found that more weights changed signs in the first convolutional layer than in the next two layers. This is consistent with the network filters structured towards low pass in depth after training.
4 Conclusions
In this paper, we have applied the RVSM algorithm to learn sparse neural networks. We have achieved an accuracy of and a sparsity of when training CNNs on a data set consisted of synthetic handwritten letters and planar curves by PD patients, and normal handwriting. We have also discussed the tuning of thresholding parameters, and verified the fact that a higher threshold can produce higher sparsity. What’s more, our experiments show that the RVSM outperforms the direct application of SGD on the penalized loss function, in both sparsity and accuracy. The RVSM generates a significant gap between the weights of large scale and small scale, which acts as an indicator to show sparsity.
4.0.1 Acknowledgements.
The work was partially supported by NSF grant IIS-1632935. The authors would like to thank Profs. Xiang Gao and Wenrui Hao at Penn State Universty for helpful discussions of handwritings and drawings on neuropsychological exams and diagnosis.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Blumensath, T., Davies, M.: Iterative thresholding for sparse approximations. Journal of Fourier analysis and Applications 14.5-6, 629-654 (2008)
- 2[2] Daubechies, I., Michel, D., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57.11, 1413-1457 (2004)
- 3[3] Dinh, T., Xin, J.: Convergence of a Relaxed Variable Splitting Method for Learning Sparse Neural Networks via ℓ 1 subscript ℓ 1 \ell_{1} , ℓ 0 subscript ℓ 0 \ell_{0} , and transformed- ℓ 1 subscript ℓ 1 \ell_{1} Penalties. ar Xiv preprint, ar Xiv:1812.05719 (2018)
- 4[4] Louizos, C., Welling, M., Kingma, D.: Learning Sparse Neural Networks Through ℓ 0 subscript ℓ 0 \ell_{0} Regularization. ar Xiv preprint ar Xiv 1712.01312 v 2, ICLR (2018)
- 5[5] Simard, P., Steinkraus, D., Platt, J.: Best practices for convolutional neural networks applied to visual document analysis. Proceedings of the Seventh International Conference on Document Analysis and Recognition, ICDAR (2003)
- 6[6] Yin, P., Zhang, S., Lyu, J., Osher, S., Qi, Y-Y., Xin, J.: Blended Coarse Gradient Descent for Full Quantization of Deep Neural Networks. Research in the Mathematical Sciences, DOI:10.1007/s 40687-018-0177-6, online Jan 2, 2019; ar Xiv preprint ar Xiv:1808.05240 (2018)
- 7[7] Yu, D., Deng, L.: Automatic Speech Recognition: A Deep Learning Approach. Signals and Communication Technology, Springer, New York (2015)
- 8[8] Zhang, S., Xin, J.: Minimization of transformed l 1 subscript 𝑙 1 l_{1} penalty: Closed form representation and iterative thresholding algorithms. Comm. Math Sci, 15(2), pp. 511–537 (2017)
