AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural   Architecture Search

Daoyuan Chen; Yaliang Li; Minghui Qiu; Zhen Wang; Bofang Li; Bolin; Ding; Hongbo Deng; Jun Huang; Wei Lin; Jingren Zhou

arXiv:2001.04246·cs.CL·January 25, 2021·24 cites

AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search

Daoyuan Chen, Yaliang Li, Minghui Qiu, Zhen Wang, Bofang Li, Bolin, Ding, Hongbo Deng, Jun Huang, Wei Lin, Jingren Zhou

PDF

Open Access 1 Repo

TL;DR

AdaBERT introduces a task-adaptive BERT compression method using differentiable neural architecture search, significantly reducing model size and inference time while maintaining performance across NLP tasks.

Contribution

It proposes a novel task-specific BERT compression approach with neural architecture search and knowledge distillation, improving efficiency without sacrificing accuracy.

Findings

01

Achieves 12.7x to 29.3x faster inference speed.

02

Reduces model size by 11.5x to 17.0x.

03

Maintains comparable task performance.

Abstract

Large pre-trained language models such as BERT have shown their effectiveness in various natural language processing tasks. However, the huge parameter size makes them difficult to be deployed in real-time applications that require quick inference with limited resources. Existing methods compress BERT into small models while such compression is task-independent, i.e., the same compressed BERT for all different downstream tasks. Motivated by the necessity and benefits of task-oriented BERT compression, we propose a novel compression method, AdaBERT, that leverages differentiable Neural Architecture Search to automatically compress BERT into task-adaptive small models for specific tasks. We incorporate a task-oriented knowledge distillation loss to provide search hints and an efficiency-aware loss as search constraints, which enables a good trade-off between efficiency and effectiveness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alibaba/EasyTransfer/tree/master/scripts/knowledge_distillation/adabert
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · Knowledge Distillation · Sigmoid Activation · Tanh Activation · Long Short-Term Memory · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia?