DropAttention: A Regularization Method for Fully-Connected   Self-Attention Networks

Lin Zehui; Pengfei Liu; Luyao Huang; Junkun Chen; Xipeng Qiu; Xuanjing; Huang

arXiv:1907.11065·cs.CL·July 29, 2019·34 cites

DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks

Lin Zehui, Pengfei Liu, Luyao Huang, Junkun Chen, Xipeng Qiu, Xuanjing, Huang

PDF

Open Access

TL;DR

DropAttention introduces a novel regularization technique for fully-connected self-attention layers in Transformers, aiming to prevent overfitting by regularizing attention weights, and demonstrates improved performance across various tasks.

Contribution

The paper proposes DropAttention, the first dropout method specifically designed for fully-connected self-attention layers in Transformers, enhancing generalization.

Findings

01

Improves model performance on multiple tasks

02

Reduces overfitting in Transformer models

03

Effective regularization for attention weights

Abstract

Variants dropout methods have been designed for the fully-connected layer, convolutional layer and recurrent layer in neural networks, and shown to be effective to avoid overfitting. As an appealing alternative to recurrent and convolutional layers, the fully-connected self-attention layer surprisingly lacks a specific dropout method. This paper explores the possibility of regularizing the attention weights in Transformers to prevent different contextualized feature vectors from co-adaption. Experiments on a wide range of tasks show that DropAttention can improve performance and reduce overfitting.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Machine Learning and ELM

MethodsDropout