DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks
Lin Zehui, Pengfei Liu, Luyao Huang, Junkun Chen, Xipeng Qiu, Xuanjing, Huang

TL;DR
DropAttention introduces a novel regularization technique for fully-connected self-attention layers in Transformers, aiming to prevent overfitting by regularizing attention weights, and demonstrates improved performance across various tasks.
Contribution
The paper proposes DropAttention, the first dropout method specifically designed for fully-connected self-attention layers in Transformers, enhancing generalization.
Findings
Improves model performance on multiple tasks
Reduces overfitting in Transformer models
Effective regularization for attention weights
Abstract
Variants dropout methods have been designed for the fully-connected layer, convolutional layer and recurrent layer in neural networks, and shown to be effective to avoid overfitting. As an appealing alternative to recurrent and convolutional layers, the fully-connected self-attention layer surprisingly lacks a specific dropout method. This paper explores the possibility of regularizing the attention weights in Transformers to prevent different contextualized feature vectors from co-adaption. Experiments on a wide range of tasks show that DropAttention can improve performance and reduce overfitting.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Machine Learning and ELM
MethodsDropout
