Towards Non-task-specific Distillation of BERT via Sentence   Representation Approximation

Bowen Wu; Huan Zhang; Mengyuan Li; Zongsheng Wang; Qihang Feng,; Junhong Huang; Baoxun Wang

arXiv:2004.03097·cs.CL·April 8, 2020·1 cites

Towards Non-task-specific Distillation of BERT via Sentence Representation Approximation

Bowen Wu, Huan Zhang, Mengyuan Li, Zongsheng Wang, Qihang Feng,, Junhong Huang, Baoxun Wang

PDF

Open Access

TL;DR

This paper introduces a universal sentence representation distillation method that compresses BERT into a simple LSTM model, maintaining versatility across tasks and outperforming task-specific distillation approaches.

Contribution

The proposed framework enables non-task-specific distillation of BERT into a lightweight model, preserving universal semantic knowledge for diverse NLP tasks.

Findings

01

Outperforms task-specific distillation methods on GLUE benchmark

02

Achieves better efficiency compared to larger models like ELMO

03

Maintains transfer learning capability via fine-tuning

Abstract

Recently, BERT has become an essential ingredient of various NLP deep models due to its effectiveness and universal-usability. However, the online deployment of BERT is often blocked by its large-scale parameters and high computational cost. There are plenty of studies showing that the knowledge distillation is efficient in transferring the knowledge from BERT into the model with a smaller size of parameters. Nevertheless, current BERT distillation approaches mainly focus on task-specified distillation, such methodologies lead to the loss of the general semantic knowledge of BERT for universal-usability. In this paper, we propose a sentence representation approximating oriented distillation framework that can distill the pre-trained BERT into a simple LSTM based model without specifying tasks. Consistent with BERT, our distilled model is able to perform transfer learning via fine-tuning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Sentiment Analysis and Opinion Mining

MethodsLinear Layer · Knowledge Distillation · Sigmoid Activation · Tanh Activation · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections