DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with   Gradient-Disentangled Embedding Sharing

Pengcheng He; Jianfeng Gao; Weizhu Chen

arXiv:2111.09543·cs.CL·March 27, 2023·394 cites

DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

Pengcheng He, Jianfeng Gao, Weizhu Chen

PDF

Open Access 3 Repos 10 Models 1 Video

TL;DR

DeBERTaV3 introduces a gradient-disentangled embedding sharing technique to enhance pre-training efficiency and performance of DeBERTa models using a replaced token detection task, achieving state-of-the-art results on multiple benchmarks.

Contribution

The paper proposes a novel gradient-disentangled embedding sharing method that improves training efficiency and model quality in DeBERTaV3, along with replacing MLM with RTD for better pre-training.

Findings

01

DeBERTaV3 Large achieves 91.37% on GLUE, surpassing DeBERTa and ELECTRA.

02

The multi-lingual mDeBERTa outperforms XLM-R on XNLI with 79.8% accuracy.

03

The new method sets SOTA benchmarks in NLP tasks with improved efficiency.

Abstract

This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model by replacing mask language modeling (MLM) with replaced token detection (RTD), a more sample-efficient pre-training task. Our analysis shows that vanilla embedding sharing in ELECTRA hurts training efficiency and model performance. This is because the training losses of the discriminator and the generator pull token embeddings in different directions, creating the "tug-of-war" dynamics. We thus propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics, improving both training efficiency and the quality of the pre-trained model. We have pre-trained DeBERTaV3 using the same settings as DeBERTa to demonstrate its exceptional performance on a wide range of downstream natural language understanding (NLU) tasks. Taking the GLUE benchmark with eight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · XLM-R · Linear Layer · Dropout · Residual Connection · Dense Connections · Softmax · Weight Decay