UniDrop: A Simple yet Effective Technique to Improve Transformer without   Extra Cost

Zhen Wu; Lijun Wu; Qi Meng; Yingce Xia; Shufang Xie; Tao Qin; Xinyu; Dai; Tie-Yan Liu

arXiv:2104.04946·cs.CL·April 13, 2021·1 cites

UniDrop: A Simple yet Effective Technique to Improve Transformer without Extra Cost

Zhen Wu, Lijun Wu, Qi Meng, Yingce Xia, Shufang Xie, Tao Qin, Xinyu, Dai, Tie-Yan Liu

PDF

Open Access

TL;DR

This paper introduces UniDrop, a unified dropout technique that combines feature, structure, and data dropout to enhance Transformer performance without additional costs, demonstrated through NLP tasks.

Contribution

The paper proposes UniDrop, a novel integrated dropout method for Transformers, uniting three dropout types to improve performance efficiently.

Findings

01

Achieves around 1.5 BLEU improvement on IWSLT14 translation

02

Improves accuracy in text classification tasks

03

Demonstrates effectiveness with pre-trained models like RoBERTa

Abstract

Transformer architecture achieves great success in abundant natural language processing tasks. The over-parameterization of the Transformer model has motivated plenty of works to alleviate its overfitting for superior performances. With some explorations, we find simple techniques such as dropout, can greatly boost model performance with a careful design. Therefore, in this paper, we integrate different dropout techniques into the training of Transformer models. Specifically, we propose an approach named UniDrop to unites three different dropout techniques from fine-grain to coarse-grain, i.e., feature dropout, structure dropout, and data dropout. Theoretically, we demonstrate that these three dropouts play different roles from regularization perspectives. Empirically, we conduct experiments on both neural machine translation and text classification benchmark datasets. Extensive results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Weight Decay · Layer Normalization · Linear Warmup With Linear Decay · Attention Dropout · WordPiece