QuaLA-MiniLM: a Quantized Length Adaptive MiniLM
Shira Guskin, Moshe Wasserblat, Chang Wang, Haihao Shen

TL;DR
QuaLA-MiniLM is a highly efficient, quantized, length-adaptive transformer model that dynamically adjusts to various inference scenarios, achieving significant speedups with minimal accuracy loss on NLP tasks.
Contribution
This work introduces QuaLA-MiniLM, combining length adaptation, quantization, and knowledge distillation to produce a versatile, single-trained model for multiple inference budgets.
Findings
Achieves up to 8.8x speedup with less than 1% accuracy loss.
Outperforms other efficient models across various computational budgets.
Dynamically fits any inference scenario with a single training process.
Abstract
Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. A knowledge distillation approach addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding. However, the performance of these models drops as we reduce the number of layers, notably in advanced NLP tasks such as span question answering. In addition, a separate model must be trained for each inference scenario with its distinct computational budget. Dynamic-TinyBERT tackles both limitations by partially implementing the Length Adaptive Transformer (LAT) technique onto TinyBERT, achieving x3 speedup over BERT-base with minimal accuracy loss. In this work, we expand the Dynamic-TinyBERT approach to generate a much more highly efficient model. We use MiniLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGeophysical Methods and Applications · Speech Recognition and Synthesis · Soil Moisture and Remote Sensing
MethodsMulti-Head Attention · Attention Is All You Need · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Linear Layer · Absolute Position Encodings
