PAT: Pruning-Aware Tuning for Large Language Models

Yijiang Liu; Huanrui Yang; Youxin Chen; Rongyu Zhang; Miao Wang; Yuan; Du; Li Du

arXiv:2408.14721·cs.LG·January 28, 2025

PAT: Pruning-Aware Tuning for Large Language Models

Yijiang Liu, Huanrui Yang, Youxin Chen, Rongyu Zhang, Miao Wang, Yuan, Du, Li Du

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces Pruning-Aware Tuning (PAT), a method that combines structural pruning with fine-tuning of large language models to reduce redundancy while maintaining high performance, leading to faster and more efficient models.

Contribution

The paper proposes a novel PAT paradigm with Hybrid Sparsification Modules and Identity Loss to effectively integrate pruning into fine-tuning, improving efficiency without sacrificing accuracy.

Findings

01

Achieves 1.33× speedup with 25% pruning on Llama2-7b.

02

Outperforms LoRA fine-tuning by up to 1.26% in accuracy.

03

Demonstrates effectiveness across large language models.

Abstract

Large language models (LLMs) excel in language tasks, especially with supervised fine-tuning after pre-training. However, their substantial memory and computational requirements hinder practical applications. Structural pruning, which reduces less significant weight dimensions, is one solution. Yet, traditional post-hoc pruning often leads to significant performance loss, with limited recovery from further fine-tuning due to reduced capacity. Since the model fine-tuning refines the general and chaotic knowledge in pre-trained models, we aim to incorporate structural pruning with the fine-tuning, and propose the Pruning-Aware Tuning (PAT) paradigm to eliminate model redundancy while preserving the model performance to the maximum extend. Specifically, we insert the innovative Hybrid Sparsification Modules (HSMs) between the Attention and FFN components to accordingly sparsify the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

PAT: Pruning-Aware Tuning for Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need · Pruning