APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference
Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao

TL;DR
APT is a method that adaptively prunes and tunes large language models during training, significantly reducing training time and memory while maintaining high task performance.
Contribution
It introduces a dynamic approach that combines parameter tuning and pruning, improving both training efficiency and inference performance of large language models.
Findings
Maintains up to 98% task performance with 40% parameters in RoBERTa and T5.
Keeps 86.4% performance with 70% parameters in LLaMA.
Speeds up fine-tuning by up to 8x and reduces memory footprint by 70%.
Abstract
Fine-tuning and inference with large Language Models (LM) are generally known to be expensive. Parameter-efficient fine-tuning over pretrained LMs reduces training memory by updating a small number of LM parameters but does not improve inference efficiency. Structured pruning improves LM inference efficiency by removing consistent parameter blocks, yet often increases training memory and time. To improve both training and inference efficiency, we introduce APT that adaptively prunes and tunes parameters for the LMs. At the early stage of fine-tuning, APT dynamically adds salient tuning parameters for fast and accurate convergence while discarding unimportant parameters for efficiency. Compared to baselines, our experiments show that APT maintains up to 98% task performance when pruning RoBERTa and T5 models with 40% parameters left while keeping 86.4% LLaMA models' performance with 70%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsGated Linear Unit · Multi-Head Attention · Attention Is All You Need · Linear Warmup With Linear Decay · WordPiece · Adam · Weight Decay · BERT · Residual Connection · Dropout
