Towards Efficient Automatic Self-Pruning of Large Language Models
Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Fei Chao, Rongrong Ji

TL;DR
This paper introduces Self-Pruner, an autonomous framework that uses LLMs to optimize layer-wise pruning rates for large language models, significantly reducing size and computational cost with minimal accuracy loss.
Contribution
The paper presents a novel self-pruning method where LLMs autonomously perform evolutionary search to determine optimal pruning configurations without retraining.
Findings
Pruned LLaMA-2-70B to 49B with only 0.80% accuracy drop.
Achieved 1.39x speedup on NVIDIA A100 GPU.
Further pruned to 35B with 3.80% accuracy decrease and 1.70x speedup.
Abstract
Despite exceptional capabilities, Large Language Models (LLMs) still face deployment challenges due to their enormous size. Post-training structured pruning is a promising solution that prunes LLMs without the need for retraining, reducing computational overhead, and it is hardware-deployment friendly. However, the training-free nature of post-training structured pruning leads to significant performance degradation. We argue that the key to mitigating this issue lies in accurately determining the pruning rate for each layer. Meanwhile, we find that LLMs may have prior knowledge about their own redundancy. Based on this insight, we introduce an end-to-end automatic self-pruning framework for LLMs, which efficiently search layer-wise pruning rates. Specifically, leverages LLMs to autonomously execute the entire evolutionary search process to…
Peer Reviews
Decision·Submitted to ICLR 2025
- The method appears computationally efficient and manages to avoid the need for retraining, a positive step in terms of practical utility. - The empirical results demonstrate measurable performance improvements in inference speed and memory reduction, which are beneficial for deployment. - Use of LLM in Evolutionary Algorithms: The idea of involving LLMs in optimization processes is novel in application, although its implementation here lacks rigor.
- The use of an LLM for population initialization and mutation/crossover is not sufficiently novel, as it represents a straightforward adaptation rather than a novel technique. There is no detailed exploration of why an LLM is more suited to this than simpler initialization methods. - The evolutionary process is standard, with no significant adaptations tailored for LLMs. There’s also minimal effort to explain why this process would be more effective or yield better performance gains than other
- This paper leverages LLMs to autonomously guide the pruning process using evolutionary algorithms. - The experimental results that Self-Pruner achieves considerable inference speedups (up to 1.7×) with minimal accuracy loss, outperforming state-of-the-art pruning methods like LLM-Pruner and Wanda-sp. - This paper is easy to read and understand.
1. Unfair Comparison with OWL: The paper does not provide a fair comparison with OWL, a foundational work in sparsity distribution for large models. OWL should be included as a baseline in Table 1 and Table 2, not merely as part of a minor ablation study, given its significance in structured pruning research. 2. Lack of Comparison with Key Related Work: The paper omits comparisons with several recent structured pruning studies, all of which were published before ICLR's submission deadline. Thes
1. Self-Pruner reduces human-effort through using LLMs to perform mutation and crossover operations.
1. Since evolutionary algorithms have been previously applied to CNNs and transformers for pruning, it would be beneficial for the authors to elaborate on the specific novelty and technical contributions of their approach in the context of pruning LLMs. How does this approach differ from existing methods? Additionally, it would be useful to highlight any unique aspects of using LLMs to execute the mutation/crossover in evolutionary search process and how this might be innovative in the context o
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Modular Robots and Swarm Intelligence · Natural Language Processing Techniques
MethodsPruning
