Change Is the Only Constant: Dynamic LLM Slicing based on Layer Redundancy
Razvan-Gabriel Dumitru, Paul-Ioan Clotan, Vikas Yadav, Darius, Peteleaza, Mihai Surdeanu

TL;DR
This paper presents a dynamic layer-specific pruning method for Large Language Models that uses a new Layer Redundancy score to improve efficiency and performance over traditional static slicing techniques.
Contribution
It introduces a novel dynamic slicing approach based on Layer Redundancy scores, advancing model compression for LLMs beyond existing static methods like SliceGPT.
Findings
Performance improved by up to 5% over baseline.
Perplexity decreased by up to 7%.
Method maintained or enhanced model accuracy.
Abstract
This paper introduces a novel model compression approach through dynamic layer-specific pruning in Large Language Models (LLMs), enhancing the traditional methodology established by SliceGPT. By transitioning from constant to dynamic slicing, our method leverages the newly proposed Layer Redundancy (LR) score, which assesses how much change each layer changes its input by measuring the cosine similarity of the input to the output of the layer. We use this score to prune parts of individual layers based on redundancy in such a way that the average pruned percentage for all layers is a fixed value. We conducted extensive experiments using models like Llama3-8B and Mistral-7B on multiple datasets, evaluating different slicing bases and percentages to determine optimal configurations that balance efficiency and performance. Our findings show that our dynamic slicing approach not only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Network Packet Processing and Optimization
MethodsPruning
