PAT: Pruning-Aware Tuning for Large Language Models
Yijiang Liu, Huanrui Yang, Youxin Chen, Rongyu Zhang, Miao Wang, Yuan, Du, Li Du

TL;DR
This paper introduces Pruning-Aware Tuning (PAT), a method that combines structural pruning with fine-tuning of large language models to reduce redundancy while maintaining high performance, leading to faster and more efficient models.
Contribution
The paper proposes a novel PAT paradigm with Hybrid Sparsification Modules and Identity Loss to effectively integrate pruning into fine-tuning, improving efficiency without sacrificing accuracy.
Findings
Achieves 1.33× speedup with 25% pruning on Llama2-7b.
Outperforms LoRA fine-tuning by up to 1.26% in accuracy.
Demonstrates effectiveness across large language models.
Abstract
Large language models (LLMs) excel in language tasks, especially with supervised fine-tuning after pre-training. However, their substantial memory and computational requirements hinder practical applications. Structural pruning, which reduces less significant weight dimensions, is one solution. Yet, traditional post-hoc pruning often leads to significant performance loss, with limited recovery from further fine-tuning due to reduced capacity. Since the model fine-tuning refines the general and chaotic knowledge in pre-trained models, we aim to incorporate structural pruning with the fine-tuning, and propose the Pruning-Aware Tuning (PAT) paradigm to eliminate model redundancy while preserving the model performance to the maximum extend. Specifically, we insert the innovative Hybrid Sparsification Modules (HSMs) between the Attention and FFN components to accordingly sparsify the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need · Pruning
