High-Fidelity Pruning for Large Language Models

Yijun Zhu; Jianxin Wang; Chengchao Shen

arXiv:2603.08083·cs.CL·March 10, 2026

High-Fidelity Pruning for Large Language Models

Yijun Zhu, Jianxin Wang, Chengchao Shen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel entropy-based importance criterion for pruning large language models, improving their efficiency while maintaining high predictive fidelity without additional teacher models.

Contribution

It proposes a simple, effective entropy-based importance measure for Taylor pruning that preserves model performance without extra computational overhead.

Findings

01

Outperforms existing pruning methods on LLaMA and Qwen models

02

Maintains high zero-shot performance after pruning

03

Reduces computational costs compared to self distillation approaches

Abstract

Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks, yet their significant computational and memory requirements present major challenges for deployment. A common approach uses Taylor expansion on the loss function to estimate neuron importance. However, its reliance on one-hot cross entropy loss, a key limitation is that it narrowly assesses importance based only on the probability assigned to the single predicted next token, thereby ignoring the other potential predictions of the original model. An intuitive solution to address this is to employ self distillation criterion for importance evaluation. However, this approach introduces significant computational overhead by requiring a separate teacher model for supervision. To this end, we propose a simple but effective criterion, information entropy of the model's output distribution, to…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 5

Strengths

The proposed idea is simple, straightforward, and aligns well with intuition. From the engineering perspective, it is also very easy to implement. Extensive experiments were conducted on multiple LLM models, across diverse benchmarks, comparing the proposed method with several previous methods. The results demonstrate significant improvements in both performance and efficiency of the proposed method.

Weaknesses

**Explanation on the Choice of Baselines**: As discussed in the Related Work section, LLM pruning is a highly focused research area with a large body of work, which can be categorized into different approaches. However, in the experimental section, only a few methods such as LLM-pruner, LoRAPrune, and LoRAP are compared, and the rationale behind choosing these baselines is not explained. It remains unclear whether the state-of-the-art methods from each category are all covered by the baselines u

Reviewer 02Rating 4Confidence 5

Strengths

The method avoids computational overhead of teacher models and resolves gradient initialization issues in self-distillation approaches, showing 3x speedup over SDMPrune with 31% less memory usage. Demonstrates improvements across multiple model families (LLaMA, Qwen) and sparsity levels, with some configurations even exceeding dense model performance after fine-tuning. The approach is straightforward to implement, requiring only standard forward-backward passes without custom kernels or auxili

Weaknesses

The paper fundamentally lacks theoretical justification for why entropy-based importance should preserve model performance. This is not a minor omission, in my view it's a central issue that undermines the contribution's scientific rigor. **Limited Evaluation Scope**: - Exclusively focuses on zero-shot QA/classification tasks - No evaluation on reasoning, long-form generation, or conversational capabilities - Largest model tested is only 7B parameters - Limited architectural diversity beyond

Reviewer 03Rating 4Confidence 2

Strengths

- Provides a label-free, holistic signal for neuron importance estimation. - HFPrune consistently outperforms strong baselines: LLM-Pruner, LoRAPrune, SDMPrune, on LLaMA and Qwen families. - Comprehensive ablation studies validate the entropy criterion’s role in preserving output distributions. - The algorithmic description is clear and reproducible. Implementation details are systematically reported.

Weaknesses

- The pruning ratio $\rho{mlp}$ is fixed across all MLP layers, despite entropy potentially varying per layer, this could limit the functionality of HFPrune. - Lack a comparative discussion or empirical correlation analysis between entropy-based and Fisher-based importance scores. - Training-time FLOPs for fine-tuning (post-pruning recovery) are omitted.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Adversarial Robustness in Machine Learning