ELECTRA: Pre-training Text Encoders as Discriminators Rather Than   Generators

Kevin Clark; Minh-Thang Luong; Quoc V. Le; Christopher D. Manning

arXiv:2003.10555·cs.CL·March 25, 2020·541 cites

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning

PDF

Open Access 5 Repos 10 Models

TL;DR

ELECTRA introduces a more sample-efficient pre-training method for text encoders by training discriminators to detect replaced tokens, outperforming traditional masked language models like BERT in both efficiency and performance.

Contribution

The paper proposes replaced token detection as a novel pre-training task, significantly improving efficiency and effectiveness over masked language modeling.

Findings

01

Outperforms BERT with the same compute and data.

02

Achieves strong results on GLUE with less training time.

03

Comparable or better performance than RoBERTa and XLNet with less compute.

Abstract

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · Cosine Annealing · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Linear Warmup With Cosine Annealing · ELECTRA · RoBERTa · SentencePiece · Byte Pair Encoding