AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search
Daoyuan Chen, Yaliang Li, Minghui Qiu, Zhen Wang, Bofang Li, Bolin, Ding, Hongbo Deng, Jun Huang, Wei Lin, Jingren Zhou

TL;DR
AdaBERT introduces a task-adaptive BERT compression method using differentiable neural architecture search, significantly reducing model size and inference time while maintaining performance across NLP tasks.
Contribution
It proposes a novel task-specific BERT compression approach with neural architecture search and knowledge distillation, improving efficiency without sacrificing accuracy.
Findings
Achieves 12.7x to 29.3x faster inference speed.
Reduces model size by 11.5x to 17.0x.
Maintains comparable task performance.
Abstract
Large pre-trained language models such as BERT have shown their effectiveness in various natural language processing tasks. However, the huge parameter size makes them difficult to be deployed in real-time applications that require quick inference with limited resources. Existing methods compress BERT into small models while such compression is task-independent, i.e., the same compressed BERT for all different downstream tasks. Motivated by the necessity and benefits of task-oriented BERT compression, we propose a novel compression method, AdaBERT, that leverages differentiable Neural Architecture Search to automatically compress BERT into task-adaptive small models for specific tasks. We incorporate a task-oriented knowledge distillation loss to provide search hints and an efficiency-aware loss as search constraints, which enables a good trade-off between efficiency and effectiveness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · Knowledge Distillation · Sigmoid Activation · Tanh Activation · Long Short-Term Memory · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia?
