FinerCut: Finer-grained Interpretable Layer Pruning for Large Language   Models

Yang Zhang; Yawei Li; Xinpeng Wang; Qianli Shen; Barbara Plank; Bernd; Bischl; Mina Rezaei; Kenji Kawaguchi

arXiv:2405.18218·cs.LG·October 22, 2024

FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models

Yang Zhang, Yawei Li, Xinpeng Wang, Qianli Shen, Barbara Plank, Bernd, Bischl, Mina Rezaei, Kenji Kawaguchi

PDF

Open Access

TL;DR

FinerCut introduces a fine-grained, interpretable layer pruning method for large language models that retains high performance while significantly reducing model size without fine-tuning.

Contribution

It proposes a novel layer-level pruning approach considering individual self-attention and FFN layers, enabling effective model compression and interpretability.

Findings

01

Retains 90% of Llama3-8B performance with 25% layers removed

02

Removes 42% of self-attention layers in Llama3-70B while preserving 99% performance

03

Provides insights into layer pruning patterns and behaviors

Abstract

Overparametrized transformer networks are the state-of-the-art architecture for Large Language Models (LLMs). However, such models contain billions of parameters making large compute a necessity, while raising environmental concerns. To address these issues, we propose FinerCut, a new form of fine-grained layer pruning, which in contrast to prior work at the transformer block level, considers all self-attention and feed-forward network (FFN) layers within blocks as individual pruning candidates. FinerCut prunes layers whose removal causes minimal alternation to the model's output -- contributing to a new, lean, interpretable, and task-agnostic pruning method. Tested across 9 benchmarks, our approach retains 90% performance of Llama3-8B with 25% layers removed, and 95% performance of Llama3-70B with 30% layers removed, all without fine-tuning or post-pruning reconstruction. Strikingly,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsPruning