QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources
Zhikai Li, Xiaoxuan Liu, Banghua Zhu, Zhen Dong, Qingyi Gu, Kurt Keutzer

TL;DR
QFT introduces a quantization framework that enables full-parameter fine-tuning of large language models using significantly less memory, making it accessible on affordable hardware without sacrificing performance.
Contribution
It proposes a novel quantization method for all training states, allowing full-parameter fine-tuning of LLMs on standard GPUs at reduced cost.
Findings
Reduces training memory to 21% of standard methods
Enables fine-tuning of LLaMA-7B on a single GPU
Achieves comparable performance with full-precision training
Abstract
Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks. Fine-tuning these pretrained models on downstream datasets provides further significant performance gains; however, this process typically requires a large number of expensive, high-end GPUs. Although there have been efforts focused on parameter-efficient fine-tuning, they cannot fully unlock the powerful potential of full-parameter fine-tuning. In this paper, we propose QFT, a Quantized Full-parameter Tuning framework for LLMs that quantizes and stores all training states, including weights, gradients, and optimizer states, in INT8 format to reduce training memory, thereby enabling full-parameter fine-tuning on existing GPUs at an affordable cost. To ensure training performance, we make two key efforts: i) for quantized gradients and optimizer states, we…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
The paper tackles an important and timely problem of how to achieve both high efficiency and performance in model finetuning. The approach of quantized storage and full-precision computation is intuitive and promising. The paper is very well written, organized, and easy to understand.
The main concern is that the draft does not demonstrate generality of QFT. I am fairly convinced of the effectiveness of QFT for LLAMA2 finetuning, but not much beyond that. The reason is that the analysis (3.2) and evaluation only studied LLAMA2 7B & 13B. Some specific questions/suggestions on this are the following: 1. It seems that QFT should be composable with Adam and bitsnbytes, and so I feel that Tables 1 & 2 would be greatly improved by adding QFT-Adam, QFT-Adam-mixed, and QFT-bitsnbyte
This paper contributes to democratizing full-parameter finetuning of LLM. This is an important problem. This paper showed the effectiveness of quantization with Lion optimizer to reduce memory requirements of finetuning LLM. The LLM model quantized by the proposed methods shows comparable performance in various benchmarks compared to the baseline which uses full-precision finetuning. The proposed gradient flow and parameter update algorithm is well-compatible with existing deep learning framewor
Overall, the paper is well written and easy to follow. However, I do not see enough innovation from this paper. Indeed, democratizing the LLM fine-tuning is an important problem, and reducing the memory usage is one important step. The problem is that the paper seems to quantize the optimizer states, without solving much technical challenges. The authors seem to rely on the Lion optimizer, without much reason. I understand that it only stores the momentum, but there are other optimizers that st
+ The proposed framework leverages the Lion optimizer's capabilities, resulting in a notable 25% reduction in memory usage for model states. + The strategy of quantizing various model states and parameters according to their distributions represents an interesting method for addressing memory demands. + Overall writing is well-structured and easily understandable.
- Evaluation assesses the effectiveness of the proposed framework only in terms of memory and performance efficiency. - To further validate the proposed framework's effectiveness, the inclusion of additional models in addition to the Vicuna models would be advantageous.
1) Paper is reasonably well written. 2) Achieving memory reduction for fine-tuning LLaMA-2 7B/13B models with comparable accuracy with baseline.
1). Limited novelty. The weight quantization method in the paper is directly borrowed from the reference (SqueezeLLM). The authors over-claimed about their contribution by stating “uncover an intriguing pattern” of weight distribution. However, this pattern has already been thoroughly explained in detail in the SqueezeLLM paper and several related publications on LLM quantization. Utilizing the Lion optimizer to save the memory for variance hardly constitutes a novel contribution. 2). Limited
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsEvolved Sign Momentum · Focus
