Lightweight and Post-Training Structured Pruning for On-Device Large Lanaguage Models
Zihuai Xu, Yang Xu, Hongli Xu, Yunming Liao, Zhiwei Yao, Zuan Xie

TL;DR
This paper presents COMP, a lightweight post-training structured pruning method for large language models that reduces resource demands on devices without fine-tuning, using hybrid pruning and a new importance metric.
Contribution
The paper introduces COMP, a novel pruning approach combining coarse and fine-grained pruning with mask tuning, suitable for on-device LLM deployment without fine-tuning.
Findings
Achieves 6.13% performance improvement on LLaMA-2-7B at 20% pruning
Reduces memory overhead by 80%
Outperforms LLM-Pruner in efficiency and effectiveness
Abstract
Considering the hardware-friendly characteristics and broad applicability, structured pruning has emerged as an efficient solution to reduce the resource demands of large language models (LLMs) on resource-constrained devices. Traditional structured pruning methods often need fine-tuning to recover performance loss, which incurs high memory overhead and substantial data requirements, rendering them unsuitable for on-device applications. Additionally, post-training structured pruning techniques typically necessitate specific activation functions or architectural modifications, thereby limiting their scope of applications. Herein, we introduce COMP, a lightweight post-training structured pruning method that employs a hybrid-granularity pruning strategy. COMP initially prunes selected model layers based on their importance at a coarse granularity, followed by fine-grained neuron pruning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModular Robots and Swarm Intelligence · Robotics and Sensor-Based Localization
MethodsPruning
