Adaptive Pruning for Large Language Models with Structural Importance Awareness
Haotian Zheng, Jinke Ren, Yushan Sun, Ruichen Zhang, Wenbo Zhang, Zhen, Li, Dusit Niyato, Shuguang Cui, Yatong Han

TL;DR
This paper introduces a structurally-aware adaptive pruning method for large language models that reduces computational costs while maintaining high performance, enabling more efficient deployment on resource-limited devices.
Contribution
The paper proposes a novel importance fusion metric and group fine-tuning strategy for adaptive pruning of LLMs, improving efficiency without sacrificing accuracy.
Findings
Achieves over 2% accuracy improvement on multiple LLMs.
Reduces token generation time by 5%.
Outperforms several state-of-the-art pruning methods.
Abstract
The recent advancements in large language models (LLMs) have significantly improved language understanding and generation capabilities. However, it is difficult to deploy LLMs on resource-constrained edge devices due to their high computational and storage resource demands. To address this issue, we propose a novel LLM model pruning method, namely structurally-aware adaptive pruning (SAAP), to significantly reduce the computational and memory costs while maintaining model performance. We first define an adaptive importance fusion metric to evaluate the importance of all coupled structures in LLMs by considering their homoscedastic uncertainty. Then, we rank the importance of all modules to determine the specific layers that should be pruned to meet particular performance requirements. Furthermore, we develop a new group fine-tuning strategy to improve the inference efficiency of LLMs.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Pruning
