TL;DR
Putri is a novel structured pruning method for large language models that updates weights, prunes sequentially, and removes individual attention-heads, achieving state-of-the-art performance especially at high sparsity levels.
Contribution
It introduces Putri, a simple yet effective post-training pruning technique that outperforms existing methods on large language models across various sparsity ranges.
Findings
Putri achieves state-of-the-art performance in structured pruning of LLMs.
It effectively prunes models at extreme sparsity ratios.
The method generalizes well across different models and datasets.
Abstract
Large Language Models (LLMs) have experienced significant growth and development in recent years. However, performing inference on LLMs remains costly, especially for long-context inference or in resource-constrained devices. This motivates the development of new post-training pruning (PTP) methods. These methods reduce LLMs' requirements by removing a substantial part of the model's parameters. The discarded weights are selected depending on their impact on the models performance. Current PTP methods prune the models by removing the less informative hidden nodes from the FFN layers, and the least important attention layers. We propose Putri, a PTP method that introduces three changes to the State- of-the-art. First, we update the un-pruned weights of the FFN to compensate for the introduced pruning error. Second, the FFN layers are pruned sequentially, taking into account the updates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
