UniDrop: A Simple yet Effective Technique to Improve Transformer without Extra Cost
Zhen Wu, Lijun Wu, Qi Meng, Yingce Xia, Shufang Xie, Tao Qin, Xinyu, Dai, Tie-Yan Liu

TL;DR
This paper introduces UniDrop, a unified dropout technique that combines feature, structure, and data dropout to enhance Transformer performance without additional costs, demonstrated through NLP tasks.
Contribution
The paper proposes UniDrop, a novel integrated dropout method for Transformers, uniting three dropout types to improve performance efficiently.
Findings
Achieves around 1.5 BLEU improvement on IWSLT14 translation
Improves accuracy in text classification tasks
Demonstrates effectiveness with pre-trained models like RoBERTa
Abstract
Transformer architecture achieves great success in abundant natural language processing tasks. The over-parameterization of the Transformer model has motivated plenty of works to alleviate its overfitting for superior performances. With some explorations, we find simple techniques such as dropout, can greatly boost model performance with a careful design. Therefore, in this paper, we integrate different dropout techniques into the training of Transformer models. Specifically, we propose an approach named UniDrop to unites three different dropout techniques from fine-grain to coarse-grain, i.e., feature dropout, structure dropout, and data dropout. Theoretically, we demonstrate that these three dropouts play different roles from regularization perspectives. Empirically, we conduct experiments on both neural machine translation and text classification benchmark datasets. Extensive results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Weight Decay · Layer Normalization · Linear Warmup With Linear Decay · Attention Dropout · WordPiece
