Self-Balanced Dropout

Shen Li; Chenhao Su; Renfen Hu; Zhengdong Lu

arXiv:1908.01968·cs.CL·August 7, 2019

Self-Balanced Dropout

Shen Li, Chenhao Su, Renfen Hu, Zhengdong Lu

PDF

Open Access 1 Repo

TL;DR

This paper introduces Self-Balanced Dropout, a new method that addresses residual co-adaptation issues in dropout by using a trainable variable to improve model generalization across various tasks.

Contribution

It provides a theoretical proof of co-adaptation persistence after dropout and proposes a novel, trainable dropout mechanism to mitigate this problem.

Findings

01

Effective in reducing co-adaptation

02

Significantly improves performance across tasks

03

Works with simple and complex models

Abstract

Dropout is known as an effective way to reduce overfitting via preventing co-adaptations of units. In this paper, we theoretically prove that the co-adaptation problem still exists after using dropout due to the correlations among the inputs. Based on the proof, we further propose Self-Balanced Dropout, a novel dropout method which uses a trainable variable to balance the influence of the input correlation on parameter update. We evaluate Self-Balanced Dropout on a range of tasks with both simple and complex models. The experimental results show that the mechanism can effectively solve the co-adaption problem to some extent and significantly improve the performance on all tasks.

Tables4

Table 1. Table 1: Effectiveness of Self-Balanced Dropout on sentence classification task. 𝐩 𝟏 subscript 𝐩 1 \mathbf{p_{1}} and 𝐩 𝟐 subscript 𝐩 2 \mathbf{p_{2}} are the keep probability (1 - dropout rate) of the input layer and the hidden layer respectively. The first line is the result of CNN-non-static model in Kim ( 2014 ) . Results also include: MGNC-CNN Zhang et al. ( 2016b ) , MVCNN Yin and Schütze ( 2016 ) , DSCNN Zhang et al. ( 2016a ) , Semantic-CNN Li et al. ( 2017 ) and TopCNN word Zhao and Mao ( 2017 ) .

Model	MR	SST-1	SST-2	Subj	TREC	CR	MPQA
CNN-non-static	81.5	48.0	87.2	93.4	93.6	84.3	89.5
CNN-SB-Dropout	81.7	52.0	88.8	94.0	94.2	85.0	90.0
( $p_{1}$ , $p_{2}$ )	(0.8, 0.4)	(0.9, 0.7)	(1.0, 0.6)	(0.5, 0.6)	(0.8, 0.6)	(0.9, 0.6)	(0.9, 0.4)
MGNC-CNN	-	48.7	88.3	94.1	95.5	-	-
MVCNN	-	49.6	89.4	93.9	-	-	-
DSCNN	82.2	50.6	88.7	93.9	95.6	-	-
Semantic-CNN	82.1	50.8	89.0	93.7	94.4	86.0	89.3
TopCNN_word	81.7	-	-	93.4	92.5	84.9	89.9

Table 2. Table 2: F1 scores of models on CoNLL-2003.

Model	F1
ID-CNN	90.32 $\pm$ 0.26
ID-CNN-SB-Dropout	90.73 $\pm$ 0.25

Table 3. Table 3: F1 scores of models on OntoNotes 5.0.

Model	F1
ID-CNN (3 blocks)	85.27 $\pm$ 0.24
ID-CNN-SB-Dropout (3 blocks)	85.68 $\pm$ 0.20

Table 4. Table 4: BLEU scores on WMT 2014 En-De dataset.

Model	BLEU (case-sensitive)
Transformer-base	27.3
Transformer-base-SB-Dropout	27.5

Equations23

J (w) = ∥ y - X w ∥^{2} .

J (w) = ∥ y - X w ∥^{2} .

\tilde{x}_{ij} = {x_{ij} / p, 0, w i t h p r o babi l i t y p w i t h p r o babi l i t y q = 1 - p

\tilde{x}_{ij} = {x_{ij} / p, 0, w i t h p r o babi l i t y p w i t h p r o babi l i t y q = 1 - p

w minimi z e E_{R \sim B er n o u l l i (p)} [y - (R \circ \tilde{X}) w^{2}] .

w minimi z e E_{R \sim B er n o u l l i (p)} [y - (R \circ \tilde{X}) w^{2}] .

w minimi z e ∥ y - X w ∥^{2} + R (w) .

w minimi z e ∥ y - X w ∥^{2} + R (w) .

w_{j} := w_{j} - α (\frac{\partial J ( w )}{\partial w _{j}} + \frac{\partial R ( w )}{\partial w _{j}}),

w_{j} := w_{j} - α (\frac{\partial J ( w )}{\partial w _{j}} + \frac{\partial R ( w )}{\partial w _{j}}),

\frac{\partial R ( w )}{\partial w _{j}} = 2 \frac{1 - p}{p} i \sum x_{ij}^{2} w_{j} .

\frac{\partial R ( w )}{\partial w _{j}} = 2 \frac{1 - p}{p} i \sum x_{ij}^{2} w_{j} .

\tilde{x}_{ij} = {x_{ij}, x_{ma s k}, w i t h p r o babi l i t y p w i t h p r o babi l i t y q = 1 - p

\tilde{x}_{ij} = {x_{ij}, x_{ma s k}, w i t h p r o babi l i t y p w i t h p r o babi l i t y q = 1 - p

w minimi z e E_{R \sim B er n o u l l i (p)} [∥ y - (R \circ X) w - [(I - R) \circ X_{ma s k}] w ∥^{2}] .

w minimi z e E_{R \sim B er n o u l l i (p)} [∥ y - (R \circ X) w - [(I - R) \circ X_{ma s k}] w ∥^{2}] .

w minimi z e ∥ y - pX w ∥^{2} + Q (w) + \hat{R} (w) .

w minimi z e ∥ y - pX w ∥^{2} + Q (w) + \hat{R} (w) .

Q (w)

Q (w)

\hat{R} (w)

\frac{\partial R ^ ( w )}{\partial w _{j}} = 2 p (1 - p) i \sum (x_{ij} + x_{ma s k})^{2} w_{j} .

\frac{\partial R ^ ( w )}{\partial w _{j}} = 2 p (1 - p) i \sum (x_{ij} + x_{ma s k})^{2} w_{j} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shenshen-hungry/Self-Balanced-Dropout
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Software Engineering Research · Reinforcement Learning in Robotics

MethodsDropout

Full text

Self-Balanced Dropout

Shen Li*♠* &Chenhao Su*♣* &Renfen Hu*♡* &Zhengdong Lu*♠* \AND*♠*{shen, luz}@deeplycurious.ai

*♣*[email protected]

*♡*[email protected]

♠ Deeplycurious.ai

♣ SICE, Beijing University of Posts and Telecommunications

♡ Institute of Chinese Information Processing, Beijing Normal University

Abstract

Dropout is known as an effective way to reduce overfitting via preventing co-adaptations of units. In this paper, we theoretically prove that the co-adaptation problem still exists after using dropout due to the correlations among the inputs. Based on the proof, we further propose Self-Balanced Dropout, a novel dropout method which uses a trainable variable to balance the influence of the input correlation on parameter update. We evaluate Self-Balanced Dropout on a range of tasks with both simple and complex models. The experimental results show that the mechanism can effectively solve the co-adaption problem to some extent and significantly improve the performance on all tasks. 111Source codes are released at https://github.com/shenshen-hungry/Self-Balanced-Dropout.

1 Introduction

Dropout Hinton et al. (2012); Srivastava et al. (2014), an effective algorithm to reduce overfitting, has been widely used in the training of neural networks. The key idea is to randomly drop out units (input or hidden layer units) of a neural network during training. Dropout can be seen as a kind of regularization Wager et al. (2013); Baldi and Sadowski (2013); Srivastava et al. (2014); Helmbold and Long (2015).

In this paper, we find that the co-adaptation problem still exists when the input has a strong correlation, and it will cause a certain degree of overfitting. An intuitive example is in many natural language processing (NLP) tasks, words are the inputs of a model. The word distribution hypothesis Firth (1957) states that a word is characterized by the company it keeps. Therefore, for a sentence, there are some natural correlations between the words, such as cat and mouse, apple and eat. This may lead to the co-adaptations of units which can work well on a training set but cannot generalize to unseen data. Although dropout seems to prevent the co-adaptations by randomly dropping units during training, the problem still remains due to the accumulation of parameter updates.

Based on the theoretical analysis, we propose Self-Balanced Dropout, a simple but effective method that solves the co-adaptation problem caused by the input correlation. Different from the original dropout which randomly sets units to 0, this method randomly replaces units with trainable variables at each iteration. These trainable variables can reduce the impact of correlation among inputs on parameter update, allowing parameters to be updated properly. The experimental results show that Self-Balanced Dropout consistently improves the performance over the original dropout in various NLP tasks.

2 Motivation

Srivastava et al. (2014) propose the dropout mechanism, and illustrate how dropout works as a regularization term in linear models. In this section, we will brifely introduce their proof and further address the problem on parameter updates caused by correlation of the inputs.

Specifically, let $X=(x_{1},x_{2},...,x_{n})^{T}\in\mathbb{R}^{N\times D}$ be a data matrix, where each $x_{i}\in\mathbb{R}^{D}$ represents a D-dimensional data sample. $y\in\mathbb{R}^{N}$ be the label of the data. Linear regression tries to find a $w\in\mathbb{R}^{D}$ that minimizes

[TABLE]

Dropout algorithm randomly perturbs the features of the inputs. For every modified input sample $\tilde{x}_{i}$ , $x_{ij}$ is maintained with the keep probability $p$ , or set to 0 with probability $1-p$ . In practice, weight scaling is used to keep training and testing consistent, where each $x_{ij}$ is scaled down by $p$ during training. Then the final $\tilde{x}_{ij}$ is

[TABLE]

After applying dropout, the input data matrix can be expressed as $R\circ\tilde{X}$ , where $R\in\{0,1\}^{N\times D}$ is a random matrix with $r_{ij}\sim Bernoulli(p)$ , $\tilde{X}$ is the scaled data matrix and $\circ$ denotes an element-wise product. Then the objective function becomes

[TABLE]

This reduces to

[TABLE]

where $R(w)=\frac{1-p}{p}\sum_{i}\sum_{j}(x_{ij}w_{j})^{2}$ . As we can see, the role of dropout in the linear regression is equivalent to a regular term that depends on the dropout probability $p$ and the inputs $x$ .

Further, the parameter $w$ is updated by the standard stochastic gradient descent method with a learning rate of $\alpha$ like

[TABLE]

where

[TABLE]

When $w$ is small and $x$ is large, the parameter update in the regularization term will highly depend on $x$ . However, the update direction of $w$ should be guided by the label $y$ , rather than the input $x$ . When some features in $x$ are highly correlated, the regularization term will make the corresponding dimensions of $w$ be updated in a similar direction. This will still lead to the co-adaptation problem of weights with which the model may not work well when the inputs are not highly correlated at testing.

3 Self-Balanced Dropout

To solve the problem mentioned above, we propose Self-Balanced Dropout, to randomly replace the units of the inputs with a trainable variable $x_{mask}$ . Then the parameter update will not be excessively affected by the highly correlated input features, thus alleviating co-adaptation.

Formally, when Self-Balanced Dropout is applied, $\tilde{x}_{ij}$ is expressed as

[TABLE]

Since we have not zeroed any unit, all the units can emit information to the next layer and thus scaling is not needed. Then the objective function becomes

[TABLE]

This reduces to

[TABLE]

where

[TABLE]

where $X_{mask}\in\mathbb{R}^{N\times D}$ is the mask matrix in which each value is the trainable variable $x_{mask}$ . Comparing the formula 4 with the formula 9, Self-Balanced Dropout brings two changes. Firstly, $Q(w)$ forces $w$ and $x_{mask}$ to be in an inverse relationship, i.e. when $w$ is small, $x_{mask}$ will be large. Secondly, it should be noted that

[TABLE]

The update direction of $w$ is now determined by both $x_{ij}$ and $x_{mask}$ . It is not hard to derive that adding a large $x_{mask}$ to $x$ can balance the effect of $x$ on $w$ , which diminish the influence of co-adaptation.

It should be noted that some methods Vincent et al. (2008); Devlin et al. (2018) randomly mask or replace a few of the tokens at the input layer to force the model to keep a contextual representation of every input token. It is only an empirical method and no one knows the reason why it works well. From the viewpoint of the above proof, the replacing method is actually an improved dropout in the input layer and it can further reduce the co-adaptation that cannot be prevented by the original dropout. Obviously, the co-adaptation problem exists not only in the input layer but also in the hidden layers. Self-Balanced Dropout is not limited to the input layers and can be applied to any layer of a model including both input layers and hidden layers.

4 Experiment

We evaluate Self-Balanced Dropout on three tasks: text classification, named entity recognition (NER) and machine translation. For each input $x_{i}$ , the mask variable is a trainable value (Figure 1), if $x_{i}$ is a value, and the mask variable is a trainable vector (Figure 2), if $x_{i}$ is a vector, e.g. $x_{i}$ is a word representation.

4.1 Datasets and Experiment Settings

4.1.1 Sentence Classification

We employ the same seven datesets with Kim (2014), including both sentiment analysis and topic classification tasks. MR: Movie reviews sentiment datasets Pang and Lee (2005). SST-1: Stanford Sentiment Treebank with 5 sentiment labels Socher et al. (2013). To keep same with Kim (2014), we train the model on both phrases and sentences but only test on sentences. SST-2: SST-1 data with binary labels. Subj: Subjective or objective classification dataset Pang and Lee (2004). TREC: 6-class question classification dataset Li and Roth (2002). CR: Customer products review dataset Hu and Liu (2004). MPQA: Opinion polarity dataset Wiebe et al. (2005).

CNN-non-static proposed by Kim (2014) is used as our baseline. We replace the original dropout before the fully connected layer with Self-Balanced Dropout like Figure 1. In addition, Zhang and Wallace (2015) mention that dropout at the input layer helps little. Therefore, in the baseline, the input layer does not use dropout, which is equivalent to setting the keep probability to 1. Considering high correlation among the inputs may exsit because of the word distribution hypothesis, thus we assume that it is suitable to use Self-Balanced Dropout at the first layer. In the input layer, every word embedding can be replaced with the mask variable according to its Self-Balanced Dropout probability like Figure 2. For a fair comparison, we use the same hyper-parameter setting with Kim (2014)’s work.

4.1.2 Named Entity Recognition

We test Self-Balanced Dropout on the CoNLL-2003 Tjong Kim Sang and De Meulder (2003) and the OntoNotes 5.0 Hovy et al. (2006); Pradhan et al. (2013). A strong baseline ID-CNN Strubell et al. (2017) is chosen to be our baseline. Similarly, we replace the original dropout between each convolutional layer of ID-CNN with Self-Balanced Dropout like Figure 2. Except for the dropout module, the rest of the model is consistent with the baseline.

4.1.3 Machine Translation

We also replace the original dropout with Self-Balanced Dropout in Transformer, an influential deep model in NLP. Transformer-base model with the same setting in Vaswani et al. (2017) is used, except for the Self-Balanced Dropout rate which is 0.05 smaller than the original one. 222The experiments of translation are conducted in Tensor2Tensor Vaswani et al. (2018). We test the model on WMT 2014 English-German dataset.

4.2 Experimental Results and Analysis

For sentence classification task, Table 1 shows the classification accuracies on seven datasets. The results of Self-Balanced Dropout are listed in “CNN-SB-Dropout” row. The revised dropout further improves accuracy on all seven datasets. We also list the results of other models which either use multiple pre-trained embedding as inputs or use more complex deep models. By adding $x_{mask}$ into the original dropout, Self-Balanced Dropout can also achieve competitive performance against these more sophisticated methods.

For NER task, Table 2 lists F1 scores of models on CoNLL-2003 and Table 3 lists F1 scores of models on OntoNotes 5.0. We list the result of Self-Balanced Dropout in ”ID-CNN-SB-Dropout” row. Experiments show that the revised dropout consistently improves the performance over the baseline.

For translation task, Table 4 lists BLEU scores of Transformer models. Although Transformer has a very deep structure and a large number of parameters, Self-Balanced Dropout works effectively and gains an improvement.

In order to understand where the improvement comes from and to verify that our calculation is correct, we further analyze the changes in the norm of $X$ , $X_{mask}$ and $w$ in the training time. Figure 3 shows the changes of $norm(w)/norm(X)$ and $X_{mask}$ in two experiments. After several epochs, the $norm(w)/norm(X)$ continues to decrease, which means the regularization term $R(w)$ is more determined by $X$ . Meanwhile, $X_{mask}$ keeps increasing as expected, thus diminishes the influence of correlation in $X$ on parameter updates.

It is worth noting that in the above experiments, Self-Balanced Dropout is applied in a single dimension. We have tried to apply the method simultaneously in different dimensions, i.e. units in any dimension are randomly replaced with the same trainable variable. It is similar to the original dropout, while the improvement does not seem statistically significant. The reason could be that sharing the same variable in different dimensions is meaningless, and it makes the trainable variable overburdened. Therefore, replacing inputs in a single dimension is better than that in different dimensions.

5 Conclusion

In this work, we identify the inherent problem with the original dropout in causing co-adaptation problem from the perspective of regularization. This motivates us to propose Self-Balanced Dropout, a method that aims to diminish the influence of correlation in inputs on parameter updates. Since our approach is to improve the original dropout, it can also be replaced with Self-Balanced Dropout wherever the original dropout is used. The experimental results provide impressive improvements on all tasks. In the future, we will extend the proposed method to other fields.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Baldi and Sadowski (2013) Pierre Baldi and Peter J Sadowski. 2013. Understanding dropout. In Advances in neural information processing systems , pages 2814–2822.
2Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805 .
3Firth (1957) John R Firth. 1957. A synopsis of linguistic theory, 1930-1955. Studies in linguistic analysis .
4Helmbold and Long (2015) David P Helmbold and Philip M Long. 2015. On the inductive bias of dropout. The Journal of Machine Learning Research , 16(1):3403–3454.
5Hinton et al. (2012) Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. ar Xiv preprint ar Xiv:1207.0580 .
6Hovy et al. (2006) Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. Ontonotes: the 90% solution. In Proceedings of the human language technology conference of the NAACL, Companion Volume: Short Papers , pages 57–60. Association for Computational Linguistics.
7Hu and Liu (2004) Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining , pages 168–177. ACM.
8Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. ar Xiv preprint ar Xiv:1408.5882 .