# BAM! Born-Again Multi-Task Networks for Natural Language Understanding

**Authors:** Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D., Manning, Quoc V. Le

arXiv: 1907.04829 · 2019-07-11

## TL;DR

This paper introduces BAM!, a multi-task learning approach for NLP that uses knowledge distillation and teacher annealing to improve performance over traditional methods, demonstrated on the GLUE benchmark.

## Contribution

It presents a novel teacher annealing technique that enhances multi-task training by gradually shifting from distillation to supervised learning.

## Key findings

- Consistent performance improvements on GLUE benchmark
- Multi-task models outperform single-task models
- Teacher annealing aids in surpassing teacher models

## Abstract

It can be challenging to train multi-task neural networks that outperform or even match their single-task counterparts. To help address this, we propose using knowledge distillation where single-task models teach a multi-task model. We enhance this training with teacher annealing, a novel method that gradually transitions the model from distillation to supervised learning, helping the multi-task model surpass its single-task teachers. We evaluate our approach by multi-task fine-tuning BERT on the GLUE benchmark. Our method consistently improves over standard single-task and multi-task training.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.04829/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/1907.04829/full.md

## References

44 references — full list in the complete paper: https://tomesphere.com/paper/1907.04829/full.md

---
Source: https://tomesphere.com/paper/1907.04829