ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning

TL;DR
ELECTRA introduces a more sample-efficient pre-training method for text encoders by training discriminators to detect replaced tokens, outperforming traditional masked language models like BERT in both efficiency and performance.
Contribution
The paper proposes replaced token detection as a novel pre-training task, significantly improving efficiency and effectiveness over masked language modeling.
Findings
Outperforms BERT with the same compute and data.
Achieves strong results on GLUE with less training time.
Comparable or better performance than RoBERTa and XLNet with less compute.
Abstract
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗AILabTUL/BiELECTRA-czech-slovakmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗AILabTUL/mELECTRAmodel· 97 dl· ♡ 197 dl♡ 1
- 🤗Maltehb/aelaectra-danish-electra-small-cased-ner-danemodel· 99 dl· ♡ 299 dl♡ 2
- 🤗Maltehb/aelaectra-danish-electra-small-casedmodel· 418 dl· ♡ 2418 dl♡ 2
- 🤗Maltehb/aelaectra-danish-electra-small-uncased-ner-danemodel· 2 dl2 dl
- 🤗Maltehb/aelaectra-danish-electra-small-uncasedmodel· 4 dl4 dl
- 🤗Seznam/small-e-czechmodel· 761 dl· ♡ 17761 dl♡ 17
- 🤗izumi-lab/bert-small-japanese-finmodel· 41 dl· ♡ 241 dl♡ 2
- 🤗izumi-lab/bert-small-japanesemodel· 109 dl· ♡ 5109 dl♡ 5
- 🤗izumi-lab/electra-base-japanese-discriminatormodel· 19 dl· ♡ 219 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · Cosine Annealing · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Linear Warmup With Cosine Annealing · ELECTRA · RoBERTa · SentencePiece · Byte Pair Encoding
