E$^3$-Pruner: Towards Efficient, Economical, and Effective Layer Pruning for Large Language Models
Tao Yuan, Haoli Bai, Yinfei Pan, Xuyang Cao, Tianyu Zhang, Lu Hou, Ting Hu, Xianzhi Yu

TL;DR
E$^3$-Pruner introduces a novel layer pruning framework for large language models that balances performance, training cost, and inference efficiency through differentiable mask optimization and entropy-aware knowledge distillation.
Contribution
It presents a task-effective, training-economical, and inference-efficient layer pruning method with innovative mask search and knowledge distillation strategies.
Findings
Achieves 96% accuracy with only 0.8% performance drop after pruning 25% layers.
Outperforms state-of-the-art methods on diverse benchmarks.
Provides a 1.33× inference speedup with minimal training data usage.
Abstract
With the increasing size of large language models, layer pruning has gained increased attention as a hardware-friendly approach for model compression. However, existing layer pruning methods struggle to simultaneously address key practical deployment challenges, including performance degradation, high training costs, and limited acceleration. To overcome these limitations, we propose \name, a task-\underline{E}ffective, training-\underline{E}conomical and inference-\underline{E}fficient layer pruning framework. \namespace introduces two key innovations: (1) a differentiable mask optimization method using a Gumbel-TopK sampler, enabling efficient and precise pruning mask search; and (2) an entropy-aware adaptive knowledge distillation strategy that enhances task performance. Extensive experiments over diverse model architectures and benchmarks demonstrate the superiority of our method…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The problem statement and challenge are clear and significant. Section 2.2 provides a detailed analysis of why existing training-free, differentiable, and NAS-based approaches each fall short (e.g., accuracy drop, high token budgets, irregular speedups), setting up a precise target for improvement. 2. This paper is fairly written. The narrative is well-structured and easy to follow: motivation, formulation, differentiable mask search, and recovery via adaptive KD. Figures and algorithms (Fig.
1. The behavioral consistency metric is weakly defined. The paper measures consistency as average accuracy on a small mixed set rather than teacher–student agreement (e.g., output match rate, output distributions, or log-prob correlations). This undermines the claim that KD better preserves behavior. 2. Storage/IO for offline KD is unquantified. The method relies on offline Top-K logits and asserts “minor storage,” using Top-10 in all configs, but provides no concrete footprint or bandwidth num
1. This paper combines a differentiable Gumbel-TopK sampler, for efficient and accurate pruning mask search with an entropy-aware adaptive knowledge distillation strategy, for enhanced knowledge transfer with reduced computational cost. 2. This paper executes extensive experiments on across diverse LLMs with different sizes and architectures, and evaluates on multiple benchmarks, demonstrating the generalization and practicality of the proposed $E^3$-Pruner framework.
1. **Limited novelty:** Gumbel-TopK sampling for pruning has been explored in prior works [1-2] for model compression, and the progressive layer pruning strategy is a common approach (e.g., SLEB [3]) with no significant innovation here. 2. **Incomplete baselines:** Heuristic layer pruning methods like SLEB [3] and Shortened LLaMA [4] should be added as baselines to enable more comprehensive comparison and better highlight the proposed method’s advantages. 3. **Unfair comparison design:** The pap
1. This paper proposes a differentiable mask learning framework (Gumbel-TopK) for layer pruning, enabling efficient gradient-based layer selection. 2. The method achieves a good pruned model performance, outperforming prior pruning methods. 3. The paper further introduces entropy-aware adaptive knowledge distillation, effectively preserving key reasoning tokens. 4. The experiments demonstrate consistent and superior results across multiple LLMs with minimal accuracy loss.
1. The paper provides a limited theoretical explanation of why the Gumbel-TopK mask search is able to identify the optimal layers. 2. The paper does not clarify whether the performance gain comes from the layer pruning method or the Adaptive Knowledge Distillation. It would be better to compare the zero-shot performance of the pruned model without fine-tuning or apply Adaptive KD to baseline pruning methods to evaluate their relative effects.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
