Mixout: Effective Regularization to Finetune Large-scale Pretrained   Language Models

Cheolhyoung Lee; Kyunghyun Cho; Wanmo Kang

arXiv:1909.11299·cs.LG·January 24, 2020·100 cites

Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models

Cheolhyoung Lee, Kyunghyun Cho, Wanmo Kang

PDF

Open Access 2 Repos

TL;DR

Mixout is a novel regularization method inspired by dropout that improves the stability and accuracy of fine-tuning large pretrained language models, especially with limited training data.

Contribution

The paper introduces mixout, a new regularization technique that adaptively stabilizes fine-tuning of large language models, enhancing performance on downstream tasks.

Findings

01

Mixout improves fine-tuning stability.

02

Mixout increases average accuracy on GLUE tasks.

03

Mixout adapts regularization strength during training.

Abstract

In natural language processing, it has been observed recently that generalization could be greatly improved by finetuning a large-scale language model pretrained on a large unlabeled corpus. Despite its recent success and wide adoption, finetuning a large pretrained language model on a downstream task is prone to degenerate performance when there are only a small number of training instances available. In this paper, we introduce a new regularization technique, to which we refer as "mixout", motivated by dropout. Mixout stochastically mixes the parameters of two models. We show that our mixout technique regularizes learning to minimize the deviation from one of the two models and that the strength of regularization adapts along the optimization trajectory. We empirically evaluate the proposed mixout and its variants on finetuning a pretrained language model on downstream tasks. More…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsLinear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece · Softmax