TL;DR
This paper investigates various importance metrics for pruning large language models, especially focusing on self-attention layers, and explores simple methods to recover performance after pruning, aiming to reduce resource costs without significant accuracy loss.
Contribution
It introduces adaptive importance metrics like Shapley value for pruning LLMs, analyzes layer-specific pruning effects, and proposes lightweight performance recovery techniques.
Findings
Self-attention layers are more amenable to pruning, allowing up to 33% removal without performance loss.
Adaptive importance metrics show a trade-off between task performances.
Simple additive bias or low-rank adapters can effectively recover pruned model performance.
Abstract
Large Language Models (LLMs) are not only resource-intensive to train but even more costly to deploy in production. Therefore, recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance, effectively removing 10% of blocks in well-trained LLaMa-2 and Mistral 7b models without any significant degradation of downstream metrics. In this paper, we explore different block importance metrics by considering adaptive metrics such as Shapley value in addition to static ones explored in prior work. We show that adaptive metrics exhibit a trade-off in performance between tasks i.e., improvement on one task may degrade performance on the other due to differences in the computed block influences. Furthermore, we extend this analysis from a complete block to individual self-attention and feed-forward layers, highlighting the propensity of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
