Streamlining Redundant Layers to Compress Large Language Models
Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, Hong, Chen

TL;DR
This paper presents LLM-Streamline, a layer pruning and replacement method for large language models that improves compression efficiency while maintaining performance, introducing a new stability metric for evaluation.
Contribution
It introduces a novel layer pruning and replacement framework for LLMs, along with a stability metric to better evaluate compression impacts.
Findings
Outperforms previous pruning methods in accuracy and efficiency
Effectively reduces model size with minimal performance loss
Introduces a new stability metric for model evaluation
Abstract
This paper introduces LLM-Streamline, a pioneer work on layer pruning for large language models (LLMs). It is based on the observation that different layers have varying impacts on hidden states, enabling the identification of less important layers to be pruned.LLM-Streamline comprises two parts: layer pruning, which removes consecutive layers with the lowest importance based on target sparsity, and layer replacement, a novel module that trains a lightweight network to replace the pruned layers to mitigate performance loss. Additionally, a new metric called stability is proposed to address the limitations of the widely used accuracy metric in evaluating model compression. Experiments show that LLM-Streamline outperforms both previous and concurrent state-of-the-art pruning methods in terms of both performance and training efficiency.Our code is available at…
Peer Reviews
Decision·ICLR 2025 Spotlight
Basically, this submission has two major innovations: 1. layer replacement 2. new metric named stability. The first one is a very good contribution that mitigates the loss of pruning only.
I didn't find any weakness of this submission.
1. Originality: This paper combines layer pruning with lightweight network replacement in a novel approach for compressing LLMs. This method effectively maintains model performance even after significant pruning. 2. Significance: The proposed stability metric enhances LLM compression evaluation by addressing limitations in standard accuracy metrics, providing a potentially more reliable measure of retained model performance. 3. Technical Quality: The experiments are comprehensive, covering vario
1. Metric Justification: While cosine similarity is chosen as the primary metric for layer redundancy, additional justification for this choice over other metrics (e.g., perplexity, Euclidean distance) would be beneficial. 2. Comparison with Other Methods: While the authors mention alternative approaches, such as LoRA, they do not provide a detailed comparison. A more thorough discussion of how LLM-Streamline performs relative to other popular fine-tuning and compression techniques would give re
* This paper uses a lightweight network and training based on hidden states before and after compression to compensate for the loss caused by pruning, reducing the need for computing resources and leading to better precision recovery. * The continuous layer pruning used in this paper reduces the complexity of the compressed model and is easier to accelerate on hardware than unstructured pruning and other methods. * The paper provides an in-depth analysis of the limitations within traditional a
* The sparsity levels explored in the paper do not exceed 25%, leaving higher sparsity scenarios untested. In contrast, methods like LLMPruner evaluate and compare performance at a 50% sparsity level. This omission raises concerns about whether the proposed contiguous layer pruning approach would remain effective at higher sparsity levels. * The models selected for evaluation are all based on the LLaMA architecture, which limits the assessment of the proposed method’s generalizability. Testing
Code & Models
Videos
Taxonomy
TopicsTopic Modeling
MethodsPruning
