Reassessing Layer Pruning in LLMs: New Insights and Methods
Yao Lu, Hao Cheng, Yujie Fang, Zeyu Wang, Jiaheng Wei, Dongwei Xu, Qi, Xuan, Xiaoniu Yang, Zhaowei Zhu

TL;DR
This paper benchmarks layer pruning in large language models, revealing that a simple method of pruning the last 25% of layers and fine-tuning can outperform many existing models, providing practical insights and releasing the optimized models.
Contribution
It provides a comprehensive benchmark of layer pruning strategies in LLMs, demonstrating the effectiveness of a simple pruning approach over more complex methods.
Findings
Pruning the last 25% of layers followed by fine-tuning yields strong performance.
The simple pruning method outperforms several popular LLMs of similar size.
The study offers practical guidelines for effective layer pruning in LLMs.
Abstract
Although large language models (LLMs) have achieved remarkable success across various domains, their considerable scale necessitates substantial computational resources, posing significant challenges for deployment in resource-constrained environments. Layer pruning, as a simple yet effective compression method, removes layers of a model directly, reducing computational overhead. However, what are the best practices for layer pruning in LLMs? Are sophisticated layer selection metrics truly effective? Does the LoRA (Low-Rank Approximation) family, widely regarded as a leading method for pruned model fine-tuning, truly meet expectations when applied to post-pruning fine-tuning? To answer these questions, we dedicate thousands of GPU hours to benchmarking layer pruning in LLMs and gaining insights across multiple dimensions. Our results demonstrate that a simple approach, i.e., pruning the…
Peer Reviews
Decision·ICLR 2026 Poster
Comprehensive and reproducible experimental design. Honest ablations revealing when complexity adds no value. Simple, clearly-defined recipe that practitioners can reproduce in hours. Really primarily illustrates a weakness in all the other papers on layer pruning: they ought to have used final layer pruning as the obvious control experiment and have failed to do so. Providing this missing baseline is probably important within the narrow domain of layer pruning. Experimentally verifies a fac
Scope: confined to layer pruning; ignores dominant GPU-friendly methods (structured width pruning, 2:4 sparsity, quantization). Novelty: theoretical component re-derives known results; empirical finding is mainly that others’ metrics fail. Practical relevance: minimal for most users in practice. For people training from scratch, incremental deepening is probably preferable. For people trying to squeeze a large model into a slightly smaller GPU, quantization and GPU-friendly sparsity are probab
- Clear recipe to prune layers in reverse order and fine-tune only the LM head alongside the last 1–3 layers. - Reasonable empirical baking, tested on several LLaMA-3 and Qwen-style models at several pruning ratios, and several standard benchmarks, and it still works at 70B scale. - Practical impact, simple post-pruning FT outperforms the common "prune + LoRA" setup. - Plausible architectural explanation, the Pre-LN gradient-flow analysis motivates why late layers are safer to drop.
- They don’t evaluate on generation or reasoning datasets (e.g. GSM8K), so the conclusions are validated only on specific LM-harness-style multiple-choice tasks. - Prior work shows that layer importance depends on the nature of the task. Without generation tasks, the paper assumes task-invariance of the "prune-from-the-top" rule. Later layers tend to be more critical for perplexity, so pruning them first might hurt exactly the tasks they didn’t test. - As a result, the current recipe is a strong
1.Comprehensive experimental design covering diverse pruning metrics, fine-tuning methods, and models. 2.The proposed "backward pruning + partial layer fine-tuning" strategy is simple yet effective. 3.Theoretical analysis using gradient flow provides a rationale for the method's efficacy. 4.Achieves significant performance gains across multiple models, outperforming other methods.
1.Inconsistent calibration datasets and data volumes were used for different pruning metrics, which could affect experimental fairness. 2.The performance of the pre-pruned models should be included in the results tables.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDispute Resolution and Class Actions · Scheduling and Optimization Algorithms
MethodsPruning
