ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization
Prateek Yadav, Leshem Choshen, Colin Raffel, Mohit Bansal

TL;DR
ComPEFT introduces a compression method for PEFT models using sparsification and quantization, significantly reducing size while maintaining or improving performance, enabling efficient communication and deployment of large language models.
Contribution
The paper presents ComPEFT, a novel compression technique for PEFT models that does not require retraining and achieves high compression ratios with preserved or enhanced performance.
Findings
Achieves 8x-50x compression across various models.
Outperforms QLoRA with 26x smaller size on LLaMA.
Maintains few-shot generalization and improves with model scale.
Abstract
Parameter-efficient fine-tuning (PEFT) techniques make it possible to efficiently adapt a language model to create "expert" models that specialize to new tasks or domains. Recent techniques in model merging and compositional generalization leverage these expert models by dynamically composing modules to improve zero/few-shot generalization. Despite the efficiency of PEFT methods, the size of expert models can make it onerous to retrieve expert models per query over high-latency networks like the Internet or serve multiple experts on a single GPU. To address these issues, we present ComPEFT, a novel method for compressing fine-tuning residuals (task vectors) of PEFT based models. ComPEFT employs sparsification and ternary quantization to reduce the size of the PEFT module without performing any additional retraining while preserving or enhancing model performance. In extensive evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsGated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Attention Dropout · Residual Connection · Inverse Square Root Schedule · Byte Pair Encoding · Layer Normalization
