R-Drop: Regularized Dropout for Neural Networks
Xiaobo Liang, Lijun Wu, Juntao Li, Yue Wang, Qi Meng, Tao Qin, Wei, Chen, Min Zhang, Tie-Yan Liu

TL;DR
R-Drop introduces a regularization method that enhances dropout by encouraging consistent output distributions from sub-models, leading to improved performance across diverse NLP and vision tasks, including state-of-the-art results in machine translation.
Contribution
It proposes R-Drop, a novel regularization strategy that enforces output consistency between dropout sub-models, improving neural network training and performance.
Findings
R-Drop improves results on 5 deep learning tasks across 18 datasets.
It achieves state-of-the-art BLEU scores on WMT14 translation tasks.
R-Drop enhances fine-tuning of large pre-trained models like BART and RoBERTa.
Abstract
Dropout is a powerful and widely used technique to regularize the training of deep neural networks. In this paper, we introduce a simple regularization strategy upon dropout in model training, namely R-Drop, which forces the output distributions of different sub models generated by dropout to be consistent with each other. Specifically, for each training sample, R-Drop minimizes the bidirectional KL-divergence between the output distributions of two sub models sampled by dropout. Theoretical analysis reveals that R-Drop reduces the freedom of the model parameters and complements dropout. Experiments on widely used deep learning tasks ( datasets in total), including neural machine translation, abstractive summarization, language understanding, language modeling, and image classification, show that R-Drop is universally effective. In particular, it yields substantial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Layer Normalization · Label Smoothing
