One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments
Ke Yi, Yuhui Xu, Heng Chang, Chen Tang, Yuan Meng, Tong Zhang, Jia Li

TL;DR
This paper introduces a once-for-all framework for quantized large language models, enabling efficient deployment across diverse scenarios with a single training process, by decoupling weights, using low-rank adapters, and balancing training resource allocation.
Contribution
It extends the OFA framework to large language models by decoupling shared weights, integrating low-rank adapters, and proposing a non-parametric scheduler for balanced training of quantized subnets.
Findings
Maintains high performance across multiple scenarios.
Reduces deployment time significantly.
Effective resource allocation among subnets.
Abstract
Large Language Models (LLMs) have advanced rapidly but face significant memory demands. While quantization has shown promise for LLMs, current methods typically require lengthy training to alleviate the performance degradation from quantization loss. However, deploying LLMs across diverse scenarios with different resource constraints, e.g., servers and personal computers, requires repeated training per application, which amplifies the lengthy training problem. Given that, it is advantageous to train a once-for-all (OFA) supernet capable of yielding diverse optimal subnets for downstream applications through one-shot training. Nonetheless, the scale of current language models impedes efficiency and amplifies interference from weight sharing between subnets. We make an initial attempt to extend the once-for-all framework to large language models. Specifically, we decouple shared weights…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Reservoir Computing · Machine Learning and ELM · Semiconductor materials and devices
