Learning to Sample Replacements for ELECTRA Pre-Training

Yaru Hao; Li Dong; Hangbo Bao; Ke Xu; Furu Wei

arXiv:2106.13715·cs.CL·June 28, 2021

Learning to Sample Replacements for ELECTRA Pre-Training

Yaru Hao, Li Dong, Hangbo Bao, Ke Xu, Furu Wei

PDF

Open Access

TL;DR

This paper enhances ELECTRA pre-training by introducing a hardness prediction mechanism and focal loss for the generator, leading to more effective replacement sampling and improved downstream task performance.

Contribution

It proposes novel sampling and loss techniques to address inefficiencies and biases in ELECTRA's generator-discriminator training process.

Findings

01

Improved downstream task performance with the new sampling methods.

02

Reduced training variance of the discriminator.

03

Mitigated over-confidence bias in generator predictions.

Abstract

ELECTRA pretrains a discriminator to detect replaced tokens, where the replacements are sampled from a generator trained with masked language modeling. Despite the compelling performance, ELECTRA suffers from the following two issues. First, there is no direct feedback loop from discriminator to generator, which renders replacement sampling inefficient. Second, the generator's prediction tends to be over-confident along with training, making replacements biased to correct tokens. In this paper, we propose two methods to improve replacement sampling for ELECTRA pre-training. Specifically, we augment sampling with a hardness prediction mechanism, so that the generator can encourage the discriminator to learn what it has not acquired. We also prove that efficient sampling reduces the training variance of the discriminator. Moreover, we propose to use a focal loss for the generator in order…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · WordPiece · Dropout · Layer Normalization · Adam · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Focal Loss