Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for   Large Language Models

Bowen Ping; Shuo Wang; Hanqing Wang; Xu Han; Yuzhuang Xu; Yukun Yan,; Yun Chen; Baobao Chang; Zhiyuan Liu; Maosong Sun

arXiv:2406.08903·cs.CL·November 27, 2024

Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models

Bowen Ping, Shuo Wang, Hanqing Wang, Xu Han, Yuzhuang Xu, Yukun Yan,, Yun Chen, Baobao Chang, Zhiyuan Liu, Maosong Sun

PDF

Open Access 1 Repo 1 Video

TL;DR

Delta-CoMe introduces a training-free delta compression technique using mixed-precision quantization for large language models, maintaining performance while reducing costs and enabling efficient deployment across various models.

Contribution

The paper proposes a novel mixed-precision delta quantization method that preserves model performance better than existing low-rank and low-bit approaches for fine-tuned LLMs.

Findings

01

Outperforms low-rank and low-bit baselines in experiments

02

Maintains performance comparable to full fine-tuned models

03

Compatible with multiple backbone LLMs like Llama-2, Llama-3, and Mistral

Abstract

Fine-tuning is a crucial process for adapting large language models (LLMs) to diverse applications. In certain scenarios, such as multi-tenant serving, deploying multiple LLMs becomes necessary to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs (e.g., WizardMath for math problems). Motivated by the long-tail distribution of singular values in the delta weights, we propose a delta quantization approach using mixed-precision. This method employs higher-bit representation for singular vectors corresponding to larger singular values. We evaluate our approach on various fine-tuned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thunlp/delta-come
pytorchOfficial

Videos

Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsBalanced Selection