Scheduled DropHead: A Regularization Method for Transformer Models

Wangchunshu Zhou; Tao Ge; Ke Xu; Furu Wei; Ming Zhou

arXiv:2004.13342·cs.CL·November 3, 2020·6 cites

Scheduled DropHead: A Regularization Method for Transformer Models

Wangchunshu Zhou, Tao Ge, Ke Xu, Furu Wei, Ming Zhou

PDF

Open Access 1 Repo

TL;DR

Scheduled DropHead is a structured dropout technique for transformer models that drops entire attention heads during training, improving regularization and efficiency in NLP tasks.

Contribution

It introduces DropHead, a novel structured dropout method for multi-head attention, with an adaptive dropout rate schedule to enhance regularization in transformer models.

Findings

01

Improves model regularization and reduces overfitting.

02

Enhances multi-head attention efficiency in NLP tasks.

03

Proven effective on machine translation and text classification datasets.

Abstract

In this paper, we introduce DropHead, a structured dropout method specifically designed for regularizing the multi-head attention mechanism, which is a key component of transformer, a state-of-the-art model for various NLP tasks. In contrast to the conventional dropout mechanisms which randomly drop units or connections, the proposed DropHead is a structured dropout method. It drops entire attention-heads during training and It prevents the multi-head attention model from being dominated by a small portion of attention heads while also reduces the risk of overfitting the training data, thus making use of the multi-head attention mechanism more efficiently. Motivated by recent studies about the learning dynamic of the multi-head attention mechanism, we propose a specific dropout rate schedule to adaptively adjust the dropout rate of DropHead and achieve better regularization effect.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

seunghwan1228/Transfomer-MachineTranslation
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention · Dropout