APT: Adaptive Pruning and Tuning Pretrained Language Models for   Efficient Training and Inference

Bowen Zhao; Hannaneh Hajishirzi; Qingqing Cao

arXiv:2401.12200·cs.CL·June 5, 2024·2 cites

APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference

Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao

PDF

Open Access 1 Repo

TL;DR

APT is a method that adaptively prunes and tunes large language models during training, significantly reducing training time and memory while maintaining high task performance.

Contribution

It introduces a dynamic approach that combines parameter tuning and pruning, improving both training efficiency and inference performance of large language models.

Findings

01

Maintains up to 98% task performance with 40% parameters in RoBERTa and T5.

02

Keeps 86.4% performance with 70% parameters in LLaMA.

03

Speeds up fine-tuning by up to 8x and reduces memory footprint by 70%.

Abstract

Fine-tuning and inference with large Language Models (LM) are generally known to be expensive. Parameter-efficient fine-tuning over pretrained LMs reduces training memory by updating a small number of LM parameters but does not improve inference efficiency. Structured pruning improves LM inference efficiency by removing consistent parameter blocks, yet often increases training memory and time. To improve both training and inference efficiency, we introduce APT that adaptively prunes and tunes parameters for the LMs. At the early stage of fine-tuning, APT dynamically adds salient tuning parameters for fast and accurate convergence while discarding unimportant parameters for efficiency. Compared to baselines, our experiments show that APT maintains up to 98% task performance when pruning RoBERTa and T5 models with 40% parameters left while keeping 86.4% LLaMA models' performance with 70%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

roim1998/apt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsGated Linear Unit · Multi-Head Attention · Attention Is All You Need · Linear Warmup With Linear Decay · WordPiece · Adam · Weight Decay · BERT · Residual Connection · Dropout