LRC-BERT: Latent-representation Contrastive Knowledge Distillation for   Natural Language Understanding

Hao Fu; Shaojun Zhou; Qihong Yang; Junjie Tang; Guiquan Liu; Kaikui; Liu; Xiaolong Li

arXiv:2012.07335·cs.CL·December 15, 2020·5 cites

LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding

Hao Fu, Shaojun Zhou, Qihong Yang, Junjie Tang, Guiquan Liu, Kaikui, Liu, Xiaolong Li

PDF

Open Access 1 Video

TL;DR

LRC-BERT introduces a contrastive knowledge distillation approach with a gradient perturbation training architecture to create a compact, robust BERT model suitable for edge deployment, outperforming existing methods on GLUE benchmarks.

Contribution

The paper presents a novel contrastive distillation method and a gradient perturbation training architecture, enhancing model robustness and efficiency for natural language understanding.

Findings

01

LRC-BERT outperforms state-of-the-art distillation methods on GLUE datasets.

02

The contrastive distillation effectively captures intermediate layer distributions.

03

Gradient perturbation improves model robustness against adversarial attacks.

Abstract

The pre-training models such as BERT have achieved great results in various natural language processing problems. However, a large number of parameters need significant amounts of memory and the consumption of inference time, which makes it difficult to deploy them on edge devices. In this work, we propose a knowledge distillation method LRC-BERT based on contrastive learning to fit the output of the intermediate layer from the angular distance aspect, which is not considered by the existing distillation methods. Furthermore, we introduce a gradient perturbation-based training architecture in the training phase to increase the robustness of LRC-BERT, which is the first attempt in knowledge distillation. Additionally, in order to better capture the distribution characteristics of the intermediate layer, we design a two-stage training method for the total distillation loss. Finally, by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LRC-BERT: Latent-Representation Contrastive Knowledge Distillation for Natural Language Understanding· underline

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques

MethodsLinear Layer · Contrastive Learning · Knowledge Distillation · Linear Warmup With Linear Decay · Attention Is All You Need · Layer Normalization · Dropout · Weight Decay · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia?