QuaLA-MiniLM: a Quantized Length Adaptive MiniLM

Shira Guskin; Moshe Wasserblat; Chang Wang; Haihao Shen

arXiv:2210.17114·cs.CL·May 11, 2023·1 cites

QuaLA-MiniLM: a Quantized Length Adaptive MiniLM

Shira Guskin, Moshe Wasserblat, Chang Wang, Haihao Shen

PDF

Open Access 2 Repos 1 Models

TL;DR

QuaLA-MiniLM is a highly efficient, quantized, length-adaptive transformer model that dynamically adjusts to various inference scenarios, achieving significant speedups with minimal accuracy loss on NLP tasks.

Contribution

This work introduces QuaLA-MiniLM, combining length adaptation, quantization, and knowledge distillation to produce a versatile, single-trained model for multiple inference budgets.

Findings

01

Achieves up to 8.8x speedup with less than 1% accuracy loss.

02

Outperforms other efficient models across various computational budgets.

03

Dynamically fits any inference scenario with a single training process.

Abstract

Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. A knowledge distillation approach addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding. However, the performance of these models drops as we reduce the number of layers, notably in advanced NLP tasks such as span question answering. In addition, a separate model must be trained for each inference scenario with its distinct computational budget. Dynamic-TinyBERT tackles both limitations by partially implementing the Length Adaptive Transformer (LAT) technique onto TinyBERT, achieving x3 speedup over BERT-base with minimal accuracy loss. In this work, we expand the Dynamic-TinyBERT approach to generate a much more highly efficient model. We use MiniLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
Intel/dynamic-minilmv2-L6-H384-squad1.1-int8-static
model· 7 dl
7 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGeophysical Methods and Applications · Speech Recognition and Synthesis · Soil Moisture and Remote Sensing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Linear Layer · Absolute Position Encodings