Adaptive Pruning for Large Language Models with Structural Importance   Awareness

Haotian Zheng; Jinke Ren; Yushan Sun; Ruichen Zhang; Wenbo Zhang; Zhen; Li; Dusit Niyato; Shuguang Cui; Yatong Han

arXiv:2412.15127·cs.CL·December 20, 2024

Adaptive Pruning for Large Language Models with Structural Importance Awareness

Haotian Zheng, Jinke Ren, Yushan Sun, Ruichen Zhang, Wenbo Zhang, Zhen, Li, Dusit Niyato, Shuguang Cui, Yatong Han

PDF

Open Access

TL;DR

This paper introduces a structurally-aware adaptive pruning method for large language models that reduces computational costs while maintaining high performance, enabling more efficient deployment on resource-limited devices.

Contribution

The paper proposes a novel importance fusion metric and group fine-tuning strategy for adaptive pruning of LLMs, improving efficiency without sacrificing accuracy.

Findings

01

Achieves over 2% accuracy improvement on multiple LLMs.

02

Reduces token generation time by 5%.

03

Outperforms several state-of-the-art pruning methods.

Abstract

The recent advancements in large language models (LLMs) have significantly improved language understanding and generation capabilities. However, it is difficult to deploy LLMs on resource-constrained edge devices due to their high computational and storage resource demands. To address this issue, we propose a novel LLM model pruning method, namely structurally-aware adaptive pruning (SAAP), to significantly reduce the computational and memory costs while maintaining model performance. We first define an adaptive importance fusion metric to evaluate the importance of all coupled structures in LLMs by considering their homoscedastic uncertainty. Then, we rank the importance of all modules to determine the specific layers that should be pruned to meet particular performance requirements. Furthermore, we develop a new group fine-tuning strategy to improve the inference efficiency of LLMs.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Pruning