Compression of Deep Learning Models for Text: A Survey
Manish Gupta, Puneet Agrawal

TL;DR
This survey reviews various methods for compressing large deep learning models in NLP, such as pruning and quantization, to facilitate their deployment in real-world applications with limited resources.
Contribution
It systematically categorizes and summarizes recent advances in NLP model compression techniques, providing a coherent overview for researchers and practitioners.
Findings
Six main compression methods identified and explained.
Comprehensive organization of recent NLP model compression research.
Highlights the importance of model efficiency for industry deployment.
Abstract
In recent years, the fields of natural language processing (NLP) and information retrieval (IR) have made tremendous progress thanksto deep learning models like Recurrent Neural Networks (RNNs), Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTMs)networks, and Transformer [120] based models like Bidirectional Encoder Representations from Transformers (BERT) [24], GenerativePre-training Transformer (GPT-2) [94], Multi-task Deep Neural Network (MT-DNN) [73], Extra-Long Network (XLNet) [134], Text-to-text transfer transformer (T5) [95], T-NLG [98] and GShard [63]. But these models are humongous in size. On the other hand,real world applications demand small model size, low response times and low computational power wattage. In this survey, wediscuss six different types of methods (Pruning, Quantization, Knowledge Distillation, Parameter Sharing, Tensor Decomposition,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Knowledge Distillation · Absolute Position Encodings · Position-Wise Feed-Forward Layer · GShard · Layer Normalization · Adam · Attention Is All You Need · Multi-Head Attention · Byte Pair Encoding
