LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding
Hao Fu, Shaojun Zhou, Qihong Yang, Junjie Tang, Guiquan Liu, Kaikui, Liu, Xiaolong Li

TL;DR
LRC-BERT introduces a contrastive knowledge distillation approach with a gradient perturbation training architecture to create a compact, robust BERT model suitable for edge deployment, outperforming existing methods on GLUE benchmarks.
Contribution
The paper presents a novel contrastive distillation method and a gradient perturbation training architecture, enhancing model robustness and efficiency for natural language understanding.
Findings
LRC-BERT outperforms state-of-the-art distillation methods on GLUE datasets.
The contrastive distillation effectively captures intermediate layer distributions.
Gradient perturbation improves model robustness against adversarial attacks.
Abstract
The pre-training models such as BERT have achieved great results in various natural language processing problems. However, a large number of parameters need significant amounts of memory and the consumption of inference time, which makes it difficult to deploy them on edge devices. In this work, we propose a knowledge distillation method LRC-BERT based on contrastive learning to fit the output of the intermediate layer from the angular distance aspect, which is not considered by the existing distillation methods. Furthermore, we introduce a gradient perturbation-based training architecture in the training phase to increase the robustness of LRC-BERT, which is the first attempt in knowledge distillation. Additionally, in order to better capture the distribution characteristics of the intermediate layer, we design a two-stage training method for the total distillation loss. Finally, by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
MethodsLinear Layer · Contrastive Learning · Knowledge Distillation · Linear Warmup With Linear Decay · Attention Is All You Need · Layer Normalization · Dropout · Weight Decay · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia?
