One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient   Deployments

Ke Yi; Yuhui Xu; Heng Chang; Chen Tang; Yuan Meng; Tong Zhang; Jia Li

arXiv:2405.20202·cs.AI·May 31, 2024

One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments

Ke Yi, Yuhui Xu, Heng Chang, Chen Tang, Yuan Meng, Tong Zhang, Jia Li

PDF

Open Access 1 Video

TL;DR

This paper introduces a once-for-all framework for quantized large language models, enabling efficient deployment across diverse scenarios with a single training process, by decoupling weights, using low-rank adapters, and balancing training resource allocation.

Contribution

It extends the OFA framework to large language models by decoupling shared weights, integrating low-rank adapters, and proposing a non-parametric scheduler for balanced training of quantized subnets.

Findings

01

Maintains high performance across multiple scenarios.

02

Reduces deployment time significantly.

03

Effective resource allocation among subnets.

Abstract

Large Language Models (LLMs) have advanced rapidly but face significant memory demands. While quantization has shown promise for LLMs, current methods typically require lengthy training to alleviate the performance degradation from quantization loss. However, deploying LLMs across diverse scenarios with different resource constraints, e.g., servers and personal computers, requires repeated training per application, which amplifies the lengthy training problem. Given that, it is advantageous to train a once-for-all (OFA) supernet capable of yielding diverse optimal subnets for downstream applications through one-shot training. Nonetheless, the scale of current language models impedes efficiency and amplifies interference from weight sharing between subnets. We make an initial attempt to extend the once-for-all framework to large language models. Specifically, we decouple shared weights…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments· underline

Taxonomy

TopicsNeural Networks and Reservoir Computing · Machine Learning and ELM · Semiconductor materials and devices