A Fast Post-Training Pruning Framework for Transformers
Woosuk Kwon, Sehoon Kim, Michael W. Mahoney, Joseph Hassoun, Kurt, Keutzer, Amir Gholami

TL;DR
This paper introduces a rapid post-training pruning method for Transformers that reduces inference costs significantly without retraining, using novel techniques to maintain accuracy and achieve fast pruning times.
Contribution
The authors propose a novel, retraining-free pruning framework for Transformers that employs a lightweight mask search, mask rearrangement, and mask tuning to efficiently prune models.
Findings
Achieves up to 2.0x FLOPs reduction and 1.56x speedup with <1% accuracy loss.
Prunes models in less than 3 minutes on a single GPU, much faster than existing methods.
Effective on BERT-base and DistilBERT, evaluated on GLUE and SQuAD.
Abstract
Pruning is an effective way to reduce the huge inference cost of Transformer models. However, prior work on pruning Transformers requires retraining the models. This can add high training cost and high complexity to model deployment, making it difficult to use in many practical situations. To address this, we propose a fast post-training pruning framework for Transformers that does not require any retraining. Given a resource constraint and a sample dataset, our framework automatically prunes the Transformer model using structured sparsity methods. To retain high accuracy without retraining, we introduce three novel techniques: (i) a lightweight mask search algorithm that finds which heads and filters to prune based on the Fisher information; (ii) mask rearrangement that complements the search algorithm; and (iii) mask tuning that reconstructs the output activations for each layer. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Linear Layer · Attention Dropout · Linear Warmup With Linear Decay · WordPiece · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Label Smoothing
