DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing
Pengcheng He, Jianfeng Gao, Weizhu Chen

TL;DR
DeBERTaV3 introduces a gradient-disentangled embedding sharing technique to enhance pre-training efficiency and performance of DeBERTa models using a replaced token detection task, achieving state-of-the-art results on multiple benchmarks.
Contribution
The paper proposes a novel gradient-disentangled embedding sharing method that improves training efficiency and model quality in DeBERTaV3, along with replacing MLM with RTD for better pre-training.
Findings
DeBERTaV3 Large achieves 91.37% on GLUE, surpassing DeBERTa and ELECTRA.
The multi-lingual mDeBERTa outperforms XLM-R on XNLI with 79.8% accuracy.
The new method sets SOTA benchmarks in NLP tasks with improved efficiency.
Abstract
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model by replacing mask language modeling (MLM) with replaced token detection (RTD), a more sample-efficient pre-training task. Our analysis shows that vanilla embedding sharing in ELECTRA hurts training efficiency and model performance. This is because the training losses of the discriminator and the generator pull token embeddings in different directions, creating the "tug-of-war" dynamics. We thus propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics, improving both training efficiency and the quality of the pre-trained model. We have pre-trained DeBERTaV3 using the same settings as DeBERTa to demonstrate its exceptional performance on a wide range of downstream natural language understanding (NLU) tasks. Taking the GLUE benchmark with eight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7model· 193k dl· ♡ 355193k dl♡ 355
- 🤗microsoft/deberta-v3-basemodel· 2.3M dl· ♡ 4132.3M dl♡ 413
- 🤗microsoft/deberta-v3-smallmodel· 1.1M dl· ♡ 751.1M dl♡ 75
- 🤗microsoft/deberta-v3-largemodel· 977k dl· ♡ 274977k dl♡ 274
- 🤗MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanlimodel· 38k dl· ♡ 12038k dl♡ 120
- 🤗MoritzLaurer/xlm-v-base-mnli-xnlimodel· 171 dl· ♡ 23171 dl♡ 23
- 🤗Morton-Li/QiDeBERTa-largemodel· 7 dl· ♡ 17 dl♡ 1
- 🤗MoritzLaurer/DeBERTa-v3-base-mnli-fever-docnli-ling-2cmodel· 744 dl· ♡ 12744 dl♡ 12
- 🤗MoritzLaurer/DeBERTa-v3-small-mnli-fever-docnli-ling-2cmodel· 87 dl87 dl
- 🤗MoritzLaurer/DeBERTa-v3-xsmall-mnli-fever-anli-ling-binarymodel· 60k dl· ♡ 660k dl♡ 6
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · XLM-R · Linear Layer · Dropout · Residual Connection · Dense Connections · Softmax · Weight Decay
