Structured Pruning of Large Language Models

Ziheng Wang; Jeremy Wohlwend; Tao Lei

arXiv:1910.04732·cs.CL·March 30, 2021

Structured Pruning of Large Language Models

Ziheng Wang, Jeremy Wohlwend, Tao Lei

PDF

2 Repos

TL;DR

This paper introduces a structured pruning method for large language models that reduces model size and latency while maintaining performance, by adaptively removing low-rank components during training.

Contribution

A novel structured pruning technique using low-rank factorization and adaptive rank-1 component removal, improving efficiency and performance over existing methods.

Findings

01

Outperforms unstructured and block-structured pruning baselines

02

Achieves significant speedups during training and inference

03

Effective in pruning adaptive embeddings and BERT models

Abstract

Large language models have recently achieved state of the art performance across a wide variety of natural language tasks. Meanwhile, the size of these models and their latency have significantly increased, which makes their usage costly, and raises an interesting question: do language models need to be large? We study this question through the lens of model compression. We present a generic, structured pruning approach by parameterizing each weight matrix using its low-rank factorization, and adaptively removing rank-1 components during training. On language modeling tasks, our structured approach outperforms other unstructured and block-structured pruning baselines at various compression levels, while achieving significant speedups during both training and inference. We also demonstrate that our method can be applied to pruning adaptive word embeddings in large language models, and to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsPruning · Linear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece