AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Efficient Language Models
Xiaofan Zhang, Zongwei Zhou, Deming Chen, Yu Emma Wang

TL;DR
AutoDistill is an end-to-end framework that combines neural architecture search and multi-objective optimization to produce hardware-efficient, high-performing NLP models with reduced latency and size.
Contribution
It introduces a novel framework integrating architecture exploration and multi-objective optimization for effective model distillation tailored to hardware constraints.
Findings
AutoDistill finds models with up to 3.2% higher accuracy and 1.44x faster inference.
Distilled models outperform BERT_BASE and other compact models on GLUE and SQuAD.
The framework reduces model size significantly while maintaining or improving performance.
Abstract
Recently, large pre-trained models have significantly improved the performance of various Natural LanguageProcessing (NLP) tasks but they are expensive to serve due to long serving latency and large memory usage. To compress these models, knowledge distillation has attracted an increasing amount of interest as one of the most effective methods for model compression. However, existing distillation methods have not yet addressed the unique challenges of model serving in datacenters, such as handling fast evolving models, considering serving performance, and optimizing for multiple objectives. To solve these problems, we propose AutoDistill, an end-to-end model distillation framework integrating model architecture exploration and multi-objective optimization for building hardware-efficient NLP pre-trained models. We use Bayesian Optimization to conduct multi-objective Neural Architecture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Dense Connections · Softmax · Multi-Head Attention · Knowledge Distillation · Residual Connection · MobileBERT
