TL;DR
This paper introduces Greedy-layer pruning, a method to dynamically reduce transformer model size post-training, achieving better speedup-performance tradeoffs than existing layer-wise pruning and approaching knowledge distillation performance.
Contribution
The paper proposes a novel greedy-layer pruning technique that outperforms existing layer-wise pruning methods and approaches knowledge distillation performance, enabling dynamic model size adjustment.
Findings
Outperforms current state-of-the-art layer-wise pruning methods.
Closes the performance gap with knowledge distillation.
Allows dynamic adjustment of model size for desired performance-speedup tradeoff.
Abstract
Fine-tuning transformer models after unsupervised pre-training reaches a very high performance on many different natural language processing tasks. Unfortunately, transformers suffer from long inference times which greatly increases costs in production. One possible solution is to use knowledge distillation, which solves this problem by transferring information from large teacher models to smaller student models. Knowledge distillation maintains high performance and reaches high compression rates, nevertheless, the size of the student model is fixed after pre-training and can not be changed individually for a given downstream task and use-case to reach a desired performance/speedup ratio. Another solution to reduce the size of models in a much more fine-grained and computationally cheaper fashion is to prune layers after the pre-training. The price to pay is that the performance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsPruning · Knowledge Distillation
