A deeper look at depth pruning of LLMs

Shoaib Ahmed Siddiqui; Xin Dong; Greg Heinrich; Thomas Breuel; Jan; Kautz; David Krueger; Pavlo Molchanov

arXiv:2407.16286·cs.LG·July 24, 2024

A deeper look at depth pruning of LLMs

Shoaib Ahmed Siddiqui, Xin Dong, Greg Heinrich, Thomas Breuel, Jan, Kautz, David Krueger, Pavlo Molchanov

PDF

1 Repo

TL;DR

This paper investigates various importance metrics for pruning large language models, especially focusing on self-attention layers, and explores simple methods to recover performance after pruning, aiming to reduce resource costs without significant accuracy loss.

Contribution

It introduces adaptive importance metrics like Shapley value for pruning LLMs, analyzes layer-specific pruning effects, and proposes lightweight performance recovery techniques.

Findings

01

Self-attention layers are more amenable to pruning, allowing up to 33% removal without performance loss.

02

Adaptive importance metrics show a trade-off between task performances.

03

Simple additive bias or low-rank adapters can effectively recover pruned model performance.

Abstract

Large Language Models (LLMs) are not only resource-intensive to train but even more costly to deploy in production. Therefore, recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance, effectively removing 10% of blocks in well-trained LLaMa-2 and Mistral 7b models without any significant degradation of downstream metrics. In this paper, we explore different block importance metrics by considering adaptive metrics such as Shapley value in addition to static ones explored in prior work. We show that adaptive metrics exhibit a trade-off in performance between tasks i.e., improvement on one task may degrade performance on the other due to differences in the computed block influences. Furthermore, we extend this analysis from a complete block to individual self-attention and feed-forward layers, highlighting the propensity of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shoaibahmed/llm_depth_pruning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.