BAM! Born-Again Multi-Task Networks for Natural Language Understanding

Kevin Clark; Minh-Thang Luong; Urvashi Khandelwal; Christopher D.; Manning; Quoc V. Le

arXiv:1907.04829·cs.CL·July 11, 2019

BAM! Born-Again Multi-Task Networks for Natural Language Understanding

Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D., Manning, Quoc V. Le

PDF

1 Repo

TL;DR

This paper introduces BAM!, a multi-task learning approach for NLP that uses knowledge distillation and teacher annealing to improve performance over traditional methods, demonstrated on the GLUE benchmark.

Contribution

It presents a novel teacher annealing technique that enhances multi-task training by gradually shifting from distillation to supervised learning.

Findings

01

Consistent performance improvements on GLUE benchmark

02

Multi-task models outperform single-task models

03

Teacher annealing aids in surpassing teacher models

Abstract

It can be challenging to train multi-task neural networks that outperform or even match their single-task counterparts. To help address this, we propose using knowledge distillation where single-task models teach a multi-task model. We enhance this training with teacher annealing, a novel method that gradually transitions the model from distillation to supervised learning, helping the multi-task model surpass its single-task teachers. We evaluate our approach by multi-task fine-tuning BERT on the GLUE benchmark. Our method consistently improves over standard single-task and multi-task training.

Tables5

Table 1. Table 1: Comparison of methods on the GLUE dev set. ∗ , ∗∗ , and ∗∗∗ indicate statistically significant ( p < .05 𝑝 .05 p<.05 , p < .01 𝑝 .01 p<.01 , and p < .001 𝑝 .001 p<.001 ) improvements over both Single and Multi according to bootstrap hypothesis tests. 5

Model	Avg.	CoLA $^{a}$	SST-2 $^{b}$	MRPC $^{c}$	STS-B $^{d}$	QQP $^{e}$	MNLI $^{f}$	QNLI $^{g}$	RTE $^{h}$
Model	Avg.	$\| 𝒟 \|$ = 8.5k	67k	3.7k	5.8k	364k	393k	108k	2.5k
Single	84.0	60.6	93.2	88.0	90.0	91.3	86.6	92.3	70.4
Multi	85.5	60.3	93.3	88.0	89.8	91.4	86.5	92.2	82.1
Single $\to$ Single	84.3	${61.7}^{* *}$	93.2	${88.7}^{*}$	90.0	91.4	${86.8}^{* *}$	${92.5}^{, *}$	70.0
Multi $\to$ Multi	85.6	60.9	93.5	88.1	89.8	${91.5}^{*}$	86.7	92.3	82.0
Single $\to$ Multi	${86.0}^{, *}$	${61.8}^{* *}$	${93.6}^{*}$	${89.3}^{* *}$	89.7	${91.6}^{*}$	${87.0}^{, *}$	${92.5}^{, *}$	${82.8}^{*}$

Table 2. Table 2: Comparison of test set results. *MT-DNN KD is distilled from a diverse ensemble of models.

Model	GLUE score
BERT-Base Devlin et al. (2019)	78.5
BERT-Large Devlin et al. (2019)	80.5
BERT on STILTs Phang et al. (2018)	82.0
MT-DNN Liu et al. (2019b)	82.2
Span-Extractive BERT on STILTs	82.3
Keskar et al. (2019)	82.3
Snorkel MeTaL ensemble	83.2
Hancock et al. (2019)	83.2
MT-DNN_KD* Liu et al. (2019a)	83.7
BERT-Large + BAM (ours)	82.3

Table 3. Table 3: Combining multi-task training with single-task fine-tuning. Improvements are statistically significant ( p < .01 𝑝 .01 p<.01 ) according to Mann-Whitney U tests. 5

Model	Avg. Score
Multi	85.5
+Single-Task Fine-Tuning	$+$ 0.3
Single $\to$ Multi	86.0
+Single-Task Fine-Tuning	$+$ 0.1

Table 4. Table 4: Ablation Study. Differences from Single → → \to Multi are statistically significant ( p < .001 𝑝 .001 p<.001 ) according to Mann-Whitney U tests. 5

Model	Avg. Score
Single $\to$ Multi	86.0
No layer-wise LRs	$-$ 0.3
No task sampling	$-$ 0.4
No teacher annealing: $λ = 0$	$-$ 0.5
No teacher annealing: $λ = 0.5$	$-$ 0.3

Table 5. Table 5: Which tasks help RTE? Pairwise differences are statistically significant ( p < .01 (p<.01 ) according to Mann-Whitney U tests. 5

Trained Tasks	RTE score
RTE	70.0
RTE + MNLI	83.4
RTE + QQP + CoLA + SST	75.1
All GLUE	82.8

Equations8

L (θ) = x_{τ}^{i}, y_{τ}^{i} \in D_{τ} \sum ℓ (y_{τ}^{i}, f_{τ} (x_{τ}^{i}, θ))

L (θ) = x_{τ}^{i}, y_{τ}^{i} \in D_{τ} \sum ℓ (y_{τ}^{i}, f_{τ} (x_{τ}^{i}, θ))

L (θ) = x_{τ}^{i}, y_{τ}^{i} \in D_{τ} \sum ℓ (f_{τ} (x_{τ}^{i}, θ^{'}), f_{τ} (x_{τ}^{i}, θ))

L (θ) = x_{τ}^{i}, y_{τ}^{i} \in D_{τ} \sum ℓ (f_{τ} (x_{τ}^{i}, θ^{'}), f_{τ} (x_{τ}^{i}, θ))

L (θ) = τ \in T \sum x_{τ}^{i}, y_{τ}^{i} \in D_{τ} \sum ℓ (f_{τ} (x_{τ}^{i}, θ_{τ}), f_{τ} (x_{τ}^{i}, θ))

L (θ) = τ \in T \sum x_{τ}^{i}, y_{τ}^{i} \in D_{τ} \sum ℓ (f_{τ} (x_{τ}^{i}, θ_{τ}), f_{τ} (x_{τ}^{i}, θ))

ℓ (λ y_{τ}^{i} + (1 - λ) f_{τ} (x_{τ}^{i}, θ_{τ}), f_{τ} (x_{τ}^{i}, θ))

ℓ (λ y_{τ}^{i} + (1 - λ) f_{τ} (x_{τ}^{i}, θ_{τ}), f_{τ} (x_{τ}^{i}, θ))

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research/google-research/tree/master/bam
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Knowledge Distillation · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece

Full text

BAM! Born-Again Multi-Task Networks for Natural Language Understanding

Kevin Clark† Minh-Thang Luong‡ Urvashi Khandelwal†

Christopher D. Manning† **Quoc V. Le‡

†**Computer Science Department, Stanford University

‡Google Brain

{kevclark,urvashik,manning}@cs.stanford.edu

{thangluong,qvl}@google.com

Abstract

It can be challenging to train multi-task neural networks that outperform or even match their single-task counterparts. To help address this, we propose using knowledge distillation where single-task models teach a multi-task model. We enhance this training with teacher annealing, a novel method that gradually transitions the model from distillation to supervised learning, helping the multi-task model surpass its single-task teachers. We evaluate our approach by multi-task fine-tuning BERT on the GLUE benchmark. Our method consistently improves over standard single-task and multi-task training.

1 Introduction

Building a single model that jointly learns to perform many tasks effectively has been a long-standing challenge in Natural Language Processing (NLP). However, applying multi-task NLP remains difficult for many applications, with multi-task models often performing worse than their single-task counterparts (Plank and Alonso, 2017; Bingel and Søgaard, 2017; McCann et al., 2018). Motivated by these results, we propose a way of applying knowledge distillation (Buciluǎ et al., 2006; Ba and Caruana, 2014; Hinton et al., 2015) so that single-task models effectively teach a multi-task model.

Knowledge distillation transfers knowledge from a “teacher” model to a “student” model by training the student to imitate the teacher’s outputs. In “born-again networks” (Furlanello et al., 2018), the teacher and student have the same neural architecture and model size, but surprisingly the student is able to surpass the teacher’s accuracy. Intuitively, distillation is effective because the teacher’s output distribution over classes provides more training signal than a one-hot label; Hinton et al. (2015) suggest that teacher outputs contain “dark knowledge” capturing additional information about training examples.

Our work extends born-again networks to the multi-task setting. We compare Single $\to$ Multi111We use Single $\to$ Multi to indicate distilling single-task “teacher” models into a multi-task “student” model. born-again distillation with several other variants (Single $\to$ Single and Multi $\to$ Multi), and also explore performing multiple rounds of distillation (Single $\to$ Multi $\to$ Single $\to$ Multi). Furthermore, we propose a simple teacher annealing method that helps the student model outperform its teachers. Teacher annealing gradually transitions the student from learning from the teacher to learning from the gold labels. This method ensures the student gets a rich training signal early in training but is not limited to only imitating the teacher.

Our experiments build upon recent success in self-supervised pre-training Dai and Le (2015); Peters et al. (2018) and multi-task fine-tune BERT Devlin et al. (2019) to perform the tasks from the GLUE natural language understanding benchmark Wang et al. (2019). Our training method, which we call Born-Again Multi-tasking (BAM)222Code is available at https://github.com/google-research/google-research/tree/master/bam, consistently outperforms standard single-task and multi-task training. Further analysis shows the multi-task models benefit from both better regularization and transfer between related tasks.

2 Related Work

Multi-task learning for neural networks in general (Caruana, 1997) and within NLP specifically (Collobert and Weston, 2008; Luong et al., 2016) has been widely studied. Much of the recent work for NLP has centered on neural architecture design: e.g., ensuring only beneficial information is shared across tasks (Liu et al., 2017; Ruder et al., 2019) or arranging tasks in linguistically-motivated hierarchies (Søgaard and Goldberg, 2016; Hashimoto et al., 2017; Sanh et al., 2019). These contributions are orthogonal to ours because we instead focus on the multi-task training algorithm.

Distilling large models into small models (Kim and Rush, 2016; Mou et al., 2016) or ensembles of models into single models (Kuncoro et al., 2016; Liu et al., 2019a) has been shown to improve results for many NLP tasks. There has also been some work on using knowledge distillation to aide in multi-task learning. In reinforcement learning, knowledge distillation has been used to regularize multi-task agents (Parisotto et al., 2016; Teh et al., 2017). In NLP, Tan et al. (2019) distill single-language-pair machine translation systems into a many-language system. However, they focus on multilingual rather than multi-task learning, use a more complex training procedure, and only experiment with Single $\to$ Multi distillation.

Concurrently with our work, several other recent works also explore fine-tuning BERT using multiple tasks (Phang et al., 2018; Liu et al., 2019b; Keskar et al., 2019). However, they use only standard transfer or multi-task learning, instead focusing on finding beneficial task pairs or designing improved task-specific components on top of BERT.

3 Methods

3.1 Multi-Task Setup

Model. All of our models are built on top of BERT (Devlin et al., 2019). This model passes byte-pair-tokenized (Sennrich et al., 2016) input sentences through a Transformer network (Vaswani et al., 2017), producing a contextualized representation for each token. The vector corresponding to the first input token333For BERT this is a special token [CLS] that is prepended to each input sequence. $c$ is passed into a task-specific classifier. For classification tasks, we use a standard softmax layer: $\text{softmax}(Wc)$ . For regression tasks, we normalize the labels so they are between 0 and 1 and then use a size-1 NN layer with a sigmoid activation: $\text{sigmoid}(w^{T}c)$ . In our multi-task models, all of the model parameters are shared across tasks except for these classifiers on top of BERT, which means less than 0.01% of the parameters are task-specific. Following BERT, the token embeddings and Transformer are initialized with weights from a self-supervised pre-training phase.

Training. Single-task training is performed as in Devlin et al. (2019). For multi-task training, examples of different tasks are shuffled together, even within minibatches. The summed loss across all tasks is minimized.

3.2 Knowledge Distillation

We use $\mathcal{D}_{\tau}=\{(x_{\tau}^{1},y_{\tau}^{1}),...,(x_{\tau}^{N},y_{\tau}^{N})\}$ to denote the training set for a task $\tau$ and $f_{\tau}(x,\theta)$ to denote the outputs for task $\tau$ produced by a neural network with parameters $\theta$ on the input $x$ (for classification tasks this is a distribution over classes). Standard supervised learning trains $\theta$ to minimize the loss on the training set:

[TABLE]

where for classification tasks $\ell$ is usually cross-entropy. Knowledge distillation trains the model to instead match the predictions of a teacher model with parameters $\theta^{\prime}$ :

[TABLE]

Note that our distilled networks are “born-again” in that the student has the same model architecture as the teacher, i.e., all of our models have the same prediction function $f_{\tau}$ for each task. For regression tasks, we train the student to minimize the L2 distance between its prediction and the teacher’s instead of using cross-entropy loss. Intuitively, knowledge distillation improves training because the full distribution over labels provided by the teacher provides a richer training signal than a one-hot label. See Furlanello et al. (2018) for a more thorough discussion.

Multi-Task Distillation. Given a set of tasks $\mathcal{T}$ , we train a single-task model with parameters $\theta_{\tau}$ on each task $\tau$ . For most experiments, we use the single-task models to teach a multi-task model with parameters $\theta$ :

[TABLE]

However, we experiment with other distillation strategies as well.

Teacher Annealing. In knowledge distillation, the student is trained to imitate the teacher. This raises the concern that the student may be limited by the teacher’s performance and not be able to substantially outperform the teacher. To address this, we propose teacher annealing, which mixes the teacher prediction with the gold label during training. Specifically, the term in the summation becomes

[TABLE]

where $\lambda$ is linearly increased from 0 to 1 throughout training. Early in training, the model is mostly distilling to get as useful of a training signal as possible. Towards the end of training, the model is mostly relying on the gold-standard labels so it can learn to surpass its teachers.

4 Experiments

Data. We use the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019), which consists of 9 natural language understanding tasks on English data. Tasks cover textual entailment (RTE and MNLI) question-answer entailment (QNLI), paraphrase (MRPC), question paraphrase (QQP), textual similarity (STS), sentiment (SST-2), linguistic acceptability (CoLA), and Winograd Schema (WNLI).

Training Details. Rather than simply shuffling the datasets for our multi-task models, we follow the task sampling procedure from Bowman et al. (2018), where the probability of training on an example for a particular task $\tau$ is proportional to $|\mathcal{D}_{\tau}|^{0.75}$ . This ensures that tasks with very large datasets don’t overly dominate the training.

We also use the layerwise-learning-rate trick from Howard and Ruder (2018). If layer 0 is the NN layer closest to the output, the learning rate for a particular layer $d$ is set to $\textsc{base\_lr}\cdot\alpha^{d}$ (i.e., layers closest to the input get lower learning rates). The intuition is that pre-trained layers closer to the input learn more general features, so they shouldn’t be altered much during training.

Hyperparameters. For single-task models, we use the same hyperparameters as in the original BERT experiments except we pick a layerwise-learning-rate decay $\alpha$ of 1.0 or 0.9 on the dev set for each task. For multi-task models, we train the model for longer (6 epochs instead of 3) and with a larger batch size (128 instead of 32), using $\alpha=0.9$ and a learning rate of 1e-4. All models use the BERT-Large pre-trained weights.

Reporting Results. Dev set results report the average score (Spearman correlation for STS, Matthews correlation for CoLA, and accuracy for the other tasks) on all GLUE tasks except WNLI, for which methods can’t outperform a majority baseline. Results show the median score of at least 20 trials with different random seeds. We find using a large number of trials is essential because results can vary significantly for different runs. For example, standard deviations in score are over $\pm 1$ for CoLA, RTE, and MRPC for multi-task models. Single-task standard deviations are even larger.

5 Results

Main Results. We compare models trained with single-task learning, multi-task learning, and several varieties of distillation in Table 1. While standard multi-task training improves over single-task training for RTE (likely because it is closely related to MNLI), there is no improvement on the other tasks. In contrast, Single $\to$ Multi knowledge distillation improves or matches the performance of the other methods on all tasks except STS, the only regression task in GLUE. We believe distillation does not work well for regression tasks because there is no distribution over classes passed on by the teacher to aid learning.

The gain for Single $\to$ Multi over Multi is larger than the gain for Single $\to$ Single over Single, suggesting that distillation works particularly well in combination with multi-task learning. Interestingly, Single $\to$ Multi works substantially better than Multi $\to$ Multi distillation. We speculate it may help that the student is exposed to a diverse set of teachers in the same way ensembles benefit from a diverse set of models, but future work is required to fully understand this phenomenon. In addition to the models reported in the table, we also trained Single $\to$ Multi $\to$ Single $\to$ Multi models. However, the difference with Single $\to$ Multi was not statistically significant, suggesting there is little value in multiple rounds of distillation.

Overall, a key benefit of our method is robustness: while standard multi-task learning produces mixed results, Single $\to$ Multi distillation consistently outperforms standard single-task and multi-task training. We also note that in some trials single-task training resulted in models that score quite poorly (e.g., less than 91 for QQP or less than 70 for MRPC), while the multi-task models have more dependable performance.

33footnotetext: For all statistical tests we use the Holm-Bonferroni method (Holm, 1979) to correct for multiple comparisons.

Test Set Results. We compare against recent work by submitting to the GLUE leaderboard. We use Single $\to$ Multi distillation. Following the procedure used by BERT, we train multiple models and submit the one with the highest average dev set score to the test set. BERT trained 10 models for each task (80 total); we trained 20 multi-task models. Results are shown in Table 2.

Our work outperforms or matches existing published results that do not rely on ensembling. However, due to the variance between trials discussed under “Reporting Results,” we think these test set numbers should be taken with a grain of salt, as they only show the performance of individual training runs (which is further complicated by the use of tricks such as dev set model selection). We believe significance testing over multiple trials would be needed to have a definitive comparison.

Single-Task Fine-Tuning. A crucial difference distinguishing our work from the STILTs, Snorkel MeTaL, and MT-DNNKD methods in Table 2 is that we do not single-task fine-tune our model. That is, we do not further train the model on individual tasks after the multi-task training finishes. While single-task fine-tuning improves results, we think to some extent it defeats the purpose of multi-task learning: the result of training is one model for each task instead of a model that can perform all of the tasks. Compared to having many single-task models, a multi-task model is simpler to deploy, faster to run, and arguably more scientifically interesting from the perspective of building general language-processing systems.

We evaluate the benefits of single-task fine-tuning in Table 3. Single-task fine-tuning initializes models with multi-task-learned weights and then performs single-task training. Hyperparameters are the same as for our single-task models except we use a smaller learning rate of 1e-5. While single-task fine-tuning unsurprisingly improves results, the gain on top of Single $\to$ Multi distillation is small, reinforcing the claim that distillation provides many of the benefits of single-task training while producing a single unified model instead of many task-specific models.

Ablation Study. We show the importance of teacher annealing and the other training tricks in Table 4. We found them all to significantly improve scores. Using pure distillation without teacher annealing (i.e., fixing $\lambda=0$ ) performs no better than standard multi-task learning, demonstrating the importance of the proposed teacher annealing method.

Comparing Combinations of Tasks. Training on a large number of tasks is known to help regularize multi-task models (Ruder, 2017). A related benefit of multi-task learning is the transfer of learned “knowledge” between closely related tasks. We investigate these two benefits by comparing several models on the RTE task, including one trained with a very closely related task (MNLI, a much large textual entailment dataset) and one trained with fairly unrelated tasks (QQP, CoLA, and SST). We use Single $\to$ Multi distillation (Single $\to$ Single in the case of the RTE-only model). Results are shown in Table 5. We find both sets of auxiliary tasks improve RTE performance, suggesting that both benefits are playing a role in improving multi-task models. Interestingly, RTE + MNLI alone slightly outperforms the model performing all tasks, perhaps because training on MNLI, which has a very large dataset, is already enough to sufficiently regularize the model.

6 Discussion and Conclusion

We have shown that Single $\to$ Multi distillation combined with teacher annealing produces results consistently better than standard single-task or multi-task training. Achieving robust multi-task gains across many tasks has remained elusive in previous research, so we hope our work will make multi-task learning more broadly useful within NLP. However, with the exception of closely related tasks with small datasets (e.g., MNLI helping RTE), the overall size of the gains from our multi-task method are small compared to the gains provided by transfer learning from self-supervised tasks (i.e., BERT). It remains to be fully understood to what extent “self-supervised pre-training is all you need” and where transfer/multi-task learning from supervised tasks can provide the most value.

Acknowledgements

We thank Robin Jia, John Hewitt, and the anonymous reviewers for their thoughtful comments and suggestions. Kevin is supported by a Google PhD Fellowship.

Bibliography44

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ba and Caruana (2014) Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep? In NIPS .
2Bingel and Søgaard (2017) Joachim Bingel and Anders Søgaard. 2017. Identifying beneficial task relations for multi-task learning in deep neural networks. In EACL .
3Bowman et al. (2018) Samuel R Bowman, Ellie Pavlick, Edouard Grave, Benjamin Van Durme, Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R Thomas Mc Coy, Roma Patel, et al. 2018. Looking for EL Mo’s friends: Sentence-level pretraining beyond language modeling. ar Xiv preprint ar Xiv:1812.10860 .
4Buciluǎ et al. (2006) Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model compression. In SIGKDD .
5Caruana (1997) Rich Caruana. 1997. Multitask learning. Machine Learning .
6Cer et al. (2017) Daniel M. Cer, Mona T. Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Sem Eval@ACL .
7Collobert and Weston (2008) Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In ICML .
8Dai and Le (2015) Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In NIPS .