Kronecker Decomposition for GPT Compression
Ali Edalati, Marzieh Tahaei, Ahmad Rashid, Vahid Partovi Nia, James J., Clark, Mehdi Rezagholizadeh

TL;DR
This paper introduces Kronecker decomposition to compress GPT-2 models, enabling efficient deployment on limited hardware while maintaining high performance through minimal pre-training and knowledge distillation.
Contribution
It proposes a novel Kronecker decomposition-based method for compressing GPT-2, combined with light pre-training and knowledge distillation, to outperform existing smaller models.
Findings
KnGPT2 outperforms DistilGPT2 on benchmark tasks.
The method achieves significant compression with minimal retraining.
Efficient pre-training maintains high performance.
Abstract
GPT is an auto-regressive Transformer-based pre-trained language model which has attracted a lot of attention in the natural language processing (NLP) domain due to its state-of-the-art performance in several downstream tasks. The success of GPT is mostly attributed to its pre-training on huge amount of data and its large number of parameters (from ~100M to billions of parameters). Despite the superior performance of GPT (especially in few-shot or zero-shot setup), this overparameterized nature of GPT can be very prohibitive for deploying this model on devices with limited computational power or memory. This problem can be mitigated using model compression techniques; however, compressing GPT models has not been investigated much in the literature. In this work, we use Kronecker decomposition to compress the linear mappings of the GPT-22 model. Our Kronecker GPT-2 model (KnGPT2) is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Knowledge Distillation · GPT-2 · Cosine Annealing · Softmax · Weight Decay · Residual Connection · Linear Warmup With Cosine Annealing
