TL;DR
Sparseout is a novel regularization technique that explicitly controls activation sparsity in deep neural networks, improving performance in language modeling and image classification by adjusting sparsity levels.
Contribution
It introduces Sparseout, a simple, efficient method that generalizes Dropout to regulate sparsity, with theoretical proof linking it to $L_q$ penalties and empirical validation across tasks.
Findings
Sparsity benefits language modeling performance.
Denser activations improve image classification.
Sparseout effectively controls activation sparsity.
Abstract
Dropout is commonly used to help reduce overfitting in deep neural networks. Sparsity is a potentially important property of neural networks, but is not explicitly controlled by Dropout-based regularization. In this work, we propose Sparseout a simple and efficient variant of Dropout that can be used to control the sparsity of the activations in a neural network. We theoretically prove that Sparseout is equivalent to an penalty on the features of a generalized linear model and that Dropout is a special case of Sparseout for neural networks. We empirically demonstrate that Sparseout is computationally inexpensive and is able to control the desired level of sparsity in the activations. We evaluated Sparseout on image classification and language modelling tasks to see the effect of sparsity on these tasks. We found that sparsity of the activations is favorable for language modelling…
| Hidden layer size | Backprop | Dropout | Sparseout | Bridgeout |
|---|---|---|---|---|
| 1024 | 5.2 | 5.3 | 5.8 | 31.6 |
| 2048 | 5.6 | 5.6 | 6.0 | 57.2 |
| Model | Penn Tree Bank | WikiText-2 |
| LSTM-Sparseout () | 62.7 | 70.18 |
| LSTM-Dropout | 62.13 | 68.34 |
| LSTM-Sparseout () | 60.57 | 67.17 |
| AWD-LSTM-Dropout [35] | 57.3 | 65.8 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDropout
11institutetext: Dept. of Computer Science, University of Saskatchewan, Canada 11email: [email protected]
11email: [email protected]
Sparseout: Controlling Sparsity in Deep Networks
Najeeb Khan
Ian Stavness
Abstract
Dropout is commonly used to help reduce overfitting in deep neural networks. Sparsity is a potentially important property of neural networks, but is not explicitly controlled by Dropout-based regularization. In this work, we propose Sparseout a simple and efficient variant of Dropout that can be used to control the sparsity of the activations in a neural network. We theoretically prove that Sparseout is equivalent to an penalty on the features of a generalized linear model and that Dropout is a special case of Sparseout for neural networks. We empirically demonstrate that Sparseout is computationally inexpensive and is able to control the desired level of sparsity in the activations. We evaluated Sparseout on image classification and language modelling tasks to see the effect of sparsity on these tasks. We found that sparsity of the activations is favorable for language modelling performance while image classification benefits from denser activations. Sparseout provides a way to investigate sparsity in state-of-the-art deep learning models. Source code for Sparseout could be found at https://github.com/najeebkhan/sparseout.
1 Introduction
Sparsity is often thought to be a desirable property for artificial neural networks. This is likely rooted in early neuroscience studies that discovered sparse coding in the visual cortex [1] hypothesizing that at any given time, only a small number of neurons are used to encode sensory information. Sparsity has been observed both in connectivity [2] and representation [1]. To mimic sparse coding from brain studies, researchers have devised approaches to encourage sparsity when training ANNs.
Sparsity has been used to regularize models by imposing a sparsity constraint on the activations of the neural network [3]. Many useful properties are ascribed to sparsity in the literature. It has been hypothesized that neurons that are rarely active are more interpretable than those that are active most of the time [4]; images of natural world objects can be described in terms of sparse statistically independent events; neural networks with sparsity constraints learn filters that resemble the mammalian visual cortex area V1 [1] and area V2 [5]; and sparsity allows faster learning [6].
One of the main motivations behind sparsity based training methods is the biological plausibility of these algorithms. However, recent studies have questioned the pervasiveness of neural sparsity. The biological studies that provide evidence for sparsity are performed when the subject is passive and in reality sparsity might not be the mechanism that the brain uses in active tasks [7]. Empirically for DNNs, new methods that may discourage sparsity, such as Maxout [8] and DARC1 [9], have achieved better performance than sparse methods in certain domains such as computer vision. Therefore it is not clear whether or not sparsity is a generally desirable property for DNNs. We hypothesize that sparsity will benefit some learning tasks and hinder others. Therefore, new DNN training approaches that include the flexibility to either encourage sparsity, where necessary, and discourage sparsity otherwise could provide improved task performance.
There are many approaches for affecting the sparsity of a DNN during training including the use of certain activation functions such as rectified linear units [10]. One of the main ways is through regularization and several deterministic regularization algorithms have been proposed to train deep neural networks with sparse weights [11, 12, 13] and sparse activations [14, 15, 16, 10, 17].
Training deep neural networks with deterministic regularization and backpropagation results in correlated activities of the neurons. To prevent such co-adaptations as well as regularize the models, stochastic regularization methods are used. Stochastic methods such as Dropout, Bridgeout and Shakeout have been shown to be equivalent to ridge, bridge and elastic-net penalties on the model weights. Previous stochastic regularization methods that explicitly encourage sparsity, i.e., Shakeout and Bridgeout require a new set of masked weights per training example in a mini-batch making them computationally expensive. Therefore, these existing methods cannot be applied to large fully connected architectures. Likewise, Shakeout and Bridgeout cannot be easily applied to other convolutional architectures such as DenseNet and Wide-ResNet that provide current state-of-the-art performance for image classification, because they cannot be used with highly optimized black-box implementations such as cuDNN [18].
In this paper, we propose Sparseout, a stochastic regularization method that is capable of either encouraging or discouraging sparsity in deep neural networks. It provides an -norm penalty on the network’s activations and therefore can vary activation sparsity by its parameter. The computational cost of Sparseout is comparable to Dropout and it can be applied to existing optimized CNN and LSTM blocks, making it applicable to state-of-the-art architectures. We provide theoretical and empirical results demonstrating the bridge-regularization capability of Sparseout. We use Sparseout to evaluate whether or not sparsity is beneficial for two distinct learning tasks: image classification and language modeling.
2 Related work
Due to the over-parameterization of deep neural networks, they suffer from large generalization error, specifically, when the dataset size is relatively small. This phenomenon is known as over-fitting. Generalization error is upper bounded by the model complexity [19] thus overfitting could be reduced by controlling the complexity of the model.
One way to control the complexity of a model is to impose constraints on the parameters of the model such as the weights in the neural networks. Such model regularization methods can be classified into deterministic and stochastic methods. Deterministic methods either remove redundant weights [13] or penalize large magnitude weights. Weight penalties are imposed by adding a regularization term to the loss function consisting of a norm of the weight matrix [20].
Stochastic methods randomly perturb the weights so as to achieve minimal co-dependency between neurons [21, 22, 23] as well as regularizing the model at the same time. Stochastic regularization has become the standard practice in training deep learning models and have outperformed deterministic regularization methods on many tasks. Stochastic regularization techniques have a Bayesian model averaging interpretation as well as they posses an equivalence to weight penalties for linear models. In terms of Bayesian estimation, weight penalties are equivalent to imposing a prior distribution on the model weights.
Beside the weight penalty interpretation, a reason for the effectiveness of stochastic regularization methods could be the prevention of correlated activations. It has been shown that high correlation between activations of the neurons results in overfitting. DeCov [24], reduces overfitting by adding a penalty term to the cost function consisting of the co-variances among the activations of the neurons over a mini-batch.
Another approach to control model complexity, inspired by sparse coding [1], is to impose a sparsity constraint on the activations of the neural network [3]. To encourage sparsity of the activations, an norm of the activations is added to the cost function [5]. Another form of penalty is to add the KL-divergence of the expected activations and a preset target sparsity value [4]. Liao et al. have used a clustering approach to obtain sparse representation by encouraging activations to form clusters [17].
Another related technique that normalizes activations in the network so as to have zero mean and unit variance is Batchnorm [25]. Although, the primary purpose of Batchnorm is accelerating training/optimization of the neural network rather than regularization, Batchnorm has reduced the need for stochastic regularization in certain domains. The above mentioned sparsity-inducing methods are deterministic and thus may result in correlated activations. In this paper we propose Sparseout that implicitly imposes an penalty on the activations thus allowing us to choose the level of sparsity in the activations as well as the stochasticity preventing correlated neural activities. Sparseout is different than Bridgeout [23] in that it is applied to activations rather than the weights. Therefore, Sparseout is orders of magnitude faster and practical than Bridgeout. We believe that Sparseout is the first theoretically-motivated technique that is capable of simultaneously controlling sparsity in activations and reducing correlations between them, besides being equivalent to Dropout for .
3 Sparseout
Consider a feedforward neural network layer , the output of -th is given by
[TABLE]
where and are the weight matrix and bias vector for the -th layer, is a non-linear activation function and is the output of the previous layer.
The Sparseout perturbed output of the -th layer is given by
[TABLE]
where is a random mask vector randomly sampled from a Bernoulli distribution with probability and scaled by and specifies the normed space to which the activations are restricted. Since the random mask is scaled by during training, no changes to the neural network are needed during testing. Since the training of the neural networks is performed using the back-propagated gradients of the error, the gradient of the Sparseout perturbed output is given by
[TABLE]
where is the sign function.
Since Sparseout operates on the activations of the neural networks similar to Dropout, Sparseout can be implemented with minimal changes to the existing Dropout implementation. Sparseout can be used with the highly optimized black-box implementations such as cuDNN [18]. The above Sparseout formulation is applicable to any feedforward network layer such as convolutional or fully connected layers as well as layers in recurrent neural networks.
Theorem 3.1
Sparseout is equivalent to an penalty on the features of a generalized linear model.
For a generalized linear model with parameters , log partition function and the perturbed design matrix of dimension , the negative log likelihood function could be split into a mean squared error term and a penalty term [26, eq. 6] given by
[TABLE]
For the Sparseout perturbation , the variance of is given by
[TABLE]
Substituting in Equation 4 we have
[TABLE]
where and .
Theorem 3.2
For non-negative activation functions, Dropout is equivalent to Sparseout when .
Setting in Equation 2 and considering the fact that is non-negative, we have , which is identical to the Dropout perturbation [21].
-normed spaces with different values of exhibit different sparsity charactersitics. For the norm space is sparse while for the norm space is dense [27]. With Sparseout we can select the norm space of the activations by choosing the value of the hyper-parameter . Thus, Sparseout allows us to control the level of sparsity in the activations of the neural networks.
4 Experimental results
4.1 Sparsity characterization
To verify that Sparseout is capable of controlling sparsity of a neural network’s activations, we train an autoencoder with a hidden layer of rectified linear units on the MNIST dataset. Dropout and Sparseout are applied to the hidden layer activations with and different values of for Sparseout. We measure sparsity of the hidden layer activations during testing (when no perturbations are applied to the activations). To measure sparsity we use the Hoyer’s measure [28]:
[TABLE]
where is a -dimensional vector, is the -norm and is the -norm of . A vector consisting of equal non-zero values has while vectors only having one non-zero element has . Figure 1 shows the Hoyer’s sparsity measure on the test set as the training progresses for different values of . As the value of decreases below , we see an increase in the sparsity of the activations, whereas for values greater than the sparsity is reduced. For , Sparseout results in the same sparsity as Dropout. These results confirm our theoretical analysis that Sparseout can be used to control sparsity of the activations in the neural networks.
4.2 Computational cost
Sparseout is computationally efficient and incurs similar training cost as Dropout. We train an autoencoder with two hidden layer sizes on MNIST with a batch size of both on Nvidia GTX 1080 GPU. As shown in Table 1, Sparseout is only fractionally more expansive than Dropout while Bridgeout is an order of magnitude more expensive even for this simple model. Also doubling the hidden layer size results in a doubling of the execution time for Brigdeout while Sparseout and Dropout have almost constant execution time due to utilization of GPU parallelism.
4.3 Image classification
Image classification is one of the key areas where deep neural networks have been highly successful achieving state-of-the-art results. The CIFAR datasets [29] are a standard benchmark for image classification. The CIFAR- dataset consists of color images of size each belonging to one of the ten classes of objects. The dataset is divided into a training set of images and a test set of images. The CIFAR- dataset is similar to CIFAR- except that the images are divided into classes of objects, thus making the classification task more harder than CIFAR-. We used the standard pre-processing of mean and standard deviation normalization. Random cropping and random horizontal flips were used for data augmentation.
We use the wide residual network (WRN) architecture to evaluate the effect of sparsity on classification accuracy using Sparseout. WRNs achieved state-of-the-art accuracy on several image classification tasks including CIFAR- and CIFAR-. WRNs are based on deep residual networks [30] that use identity links between the input and output of each layer known as the residual connections, but they employ fewer and wider layers. The residual connections helps in training very deep neural networks consisting of upto a thousand layers.
We employ a WRN with the basic building block shown in Figure 2. The stochastic regularization is applied between the convolutional layers. Each convolutional layer is preceded by batch normalization and rectified linear unit activation function. In our experiments we use the WRN architecture WRN-28-10 with depth 28 and a widening factor of 10. A Dropout probability of was used. Stochastic gradient descent with a mini-batch of 64 was used to train the networks. The learning rate was annealed from at , and epochs by a factor of as in the original WRN paper [31].
4.3.1 Image Classification Results
For image classification we found that Sparseout with resulted in better performance compared to values of as shown in Figure 3. For the accuracy drops as the training progresses beyond around epochs indicating overfitting. As shown in Table 2, error rate for is about percent lower than Dropout for CIFAR-10 and percent lower for CIFAR-100. Our baseline results are comparable to the baselines reported in the literature for CIFAR-100 and better for CIFAR-10 [31, 32].
4.4 Language Modelling
Another task for which deep learning has been widely used is natural language processing (NLP). The dimensionalty of NLP tasks is very high and sparse; therefore, sparsity is likely to play an important role in such tasks. Language modelling (LM) assigns a probability to a sequence of words. LM is an important component of several NLP tasks such as speech recognition, information retrieval and machine translation among others. Since LM is a sequential task recurrent neural networks are used for it. Vanilla RNN are difficult to train due to vanishing and exploding gradients problem. To overcome these limitations, long short term memory (LSTM) models are used instead [33].
The LSTM model is a type of recurrent neural network with layers consisting of memory cells. The weights of the nodes in a memory cell learn the long term information while a node with a self-connected edge retains short term information. The input gate, forgetting gate and output gate help in controling the flow of information in the LSTM. For a detailed review of the LSTM formulation see Lipton et al. [34].
We adapt the baseline LSTM architecture for word level language modelling from Merity et al. [35]. The model consists of 3 layers of 1150 units. To train the baseline model we used the same hyper-parameters used by Merity et al. 111https://github.com/salesforce/awd-lstm-lm except that we used only stochastic gradient descent for training.
We replace variants of Dropout with variants of Sparseout in the LSTM model. Variational Dropout [36] is replaced with variational Sparseout where a single random mask is used within a forward and backward pass. Embedding Dropout applied to the word embedding layer is similarly replaced with embedding Sparseout.
We evaluate the model on two standard word-level language modelling datasets where the task is to predict the next word and the performance is evaluated on perplexity which is the negative log likelihood raised to the exponent. The first dataset is the Penn Treebank dataset [37] that contains million words and a vocabulary size of . The second dataset is the WikiText-2 dataset [38] which contains over million words and a vocabulary of size .
4.4.1 Language modeling results
Applying Sparseout with resulted in significant overfitting as shown in Figure 4. For we found that Sparseout resulted in better prediction performance than Dropout. For PTB dataset Sparseout results in percent reduction in relative perplexity. For Wiki-2 dataset the reduction in relative perplexity is percent as shown in Table 3.
5 Discussion
Existing literature is contradictory on whether sparsity is a good [14, 15, 16, 10, 17, 32] or bad [7, 39, 9, 8, 40] property for deep neural networks. No previous study has evaluated sparse vs. non-sparse networks in a controlled fashion with stochastic regularization. In this study, we propose a new bridge-regularization scheme, Sparseout, which has the flexibility to control sparsity and the efficiency to be applied to large networks.
We evaluated Sparseout with two distinct network architectures and machine learning tasks: CNNs for image classification and LSTMs for language modeling. Our empirical results show that lower sparsity improves image classification performance, whereas higher sparsity improves performance on language modeling. These results align with the fundamental differences between data types: relatively tiny densely-featured images vs. sparsely-featured high-dimensional language data.
In this study, we chose the most suitable architecture for each task: CNNs for IID image classification and RNNs for sequential language modelling. Therefore, we evaluated task-architecture in a coupled manner. For each task, image classification or language modelling, we tested two datasets (CIFAR10/CIFAR100 or PTB/WikiText-2) and obtained consistent results regarding the benefit or lack thereof of sparse activations. It is possible, however, that the inherent sparse nature of convolutional layers requires spreading of the activations over all the neurons while enforced parsimony of representation is helpful for the fully connected gates in an LSTM. Therefore, decoupling the effect of data type from that of architecture is an important consideration we plan to investigate as future work.
Acknowledgments
This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: A strategy employed by V 1? Vision research 37 (23) (1997) 3311–3325
- 2[2] Morris, G., Nevet, A., Bergman, H.: Anatomical funneling, sparse connectivity and redundancy reduction in the neural networks of the basal ganglia. Journal of Physiology-Paris 97 (4) (2003) 581–589
- 3[3] Thom, M., Palm, G.: Sparse activity and sparse connectivity in supervised learning. Journal of Machine Learning Research 14 (Apr) (2013) 1091–1143
- 4[4] Hinton, G.: A practical guide to training restricted Boltzmann machines. Momentum 9 (1) (2010) 926
- 5[5] Lee, H., Ekanadham, C., Ng, A.Y.: Sparse deep belief net model for visual area V 2. In: Advances in neural information processing systems. (2008) 873–880
- 6[6] Schweighofer, N., Doya, K., Lay, F.: Unsupervised learning of granule cell sparse codes enhances cerebellar adaptive control. Neuroscience 103 (1) (2001) 35–50
- 7[7] Spanne, A., Jörntell, H.: Questioning the role of sparse coding in the brain. Trends in neurosciences 38 (7) (2015) 417–427
- 8[8] Goodfellow, I., Warde-farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13). (2013) 1319–1327
