Greedy-layer Pruning: Speeding up Transformer Models for Natural   Language Processing

David Peer; Sebastian Stabinger; Stefan Engl; Antonio; Rodriguez-Sanchez

arXiv:2105.14839·cs.CL·March 30, 2022

Greedy-layer Pruning: Speeding up Transformer Models for Natural Language Processing

David Peer, Sebastian Stabinger, Stefan Engl, Antonio, Rodriguez-Sanchez

PDF

1 Repo

TL;DR

This paper introduces Greedy-layer pruning, a method to dynamically reduce transformer model size post-training, achieving better speedup-performance tradeoffs than existing layer-wise pruning and approaching knowledge distillation performance.

Contribution

The paper proposes a novel greedy-layer pruning technique that outperforms existing layer-wise pruning methods and approaches knowledge distillation performance, enabling dynamic model size adjustment.

Findings

01

Outperforms current state-of-the-art layer-wise pruning methods.

02

Closes the performance gap with knowledge distillation.

03

Allows dynamic adjustment of model size for desired performance-speedup tradeoff.

Abstract

Fine-tuning transformer models after unsupervised pre-training reaches a very high performance on many different natural language processing tasks. Unfortunately, transformers suffer from long inference times which greatly increases costs in production. One possible solution is to use knowledge distillation, which solves this problem by transferring information from large teacher models to smaller student models. Knowledge distillation maintains high performance and reaches high compression rates, nevertheless, the size of the student model is fixed after pre-training and can not be changed individually for a given downstream task and use-case to reach a desired performance/speedup ratio. Another solution to reduce the size of models in a much more fine-grained and computationally cheaper fashion is to prune layers after the pre-training. The price to pay is that the performance of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deepopinion/greedy-layer-pruning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsPruning · Knowledge Distillation