2SSP: A Two-Stage Framework for Structured Pruning of LLMs
Fabrizio Sandri, Elia Cunegatti, Giovanni Iacca

TL;DR
This paper introduces 2SSP, a two-stage structured pruning framework for large language models that combines width and depth pruning to efficiently reduce model size while maintaining performance.
Contribution
The novel 2SSP framework integrates width and depth pruning strategies with a balancing mechanism, outperforming existing methods in efficiency and accuracy.
Findings
Outperforms five state-of-the-art pruning methods.
Achieves up to 50% sparsity with minimal perplexity increase.
Reduces pruning time by up to two orders of magnitude.
Abstract
We propose a novel Two-Stage framework for Structured Pruning (\textsc{2SSP}) for pruning Large Language Models (LLMs), which combines two different strategies of pruning, namely Width and Depth Pruning. The first stage (Width Pruning) removes entire neurons, hence their corresponding rows and columns, aiming to preserve the connectivity among the pruned structures in the intermediate state of the Feed-Forward Networks in each Transformer block. This is done based on an importance score measuring the impact of each neuron on the output magnitude. The second stage (Depth Pruning), instead, removes entire Attention submodules. This is done by applying an iterative process that removes the Attention with the minimum impact on a given metric of interest (in our case, perplexity). We also propose a novel mechanism to balance the sparsity rate of the two stages w.r.t. to the desired global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Digital Rights Management and Security
MethodsAttention Is All You Need · Softmax · Adam · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
