QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng, Zhang, Zhengsu Chen, Xiaopeng Zhang, Qi Tian

TL;DR
QA-LoRA introduces a quantization-aware low-rank adaptation method that enables efficient fine-tuning of large language models with reduced computational resources while maintaining accuracy.
Contribution
It presents a novel algorithm combining quantization and low-rank adaptation, improving efficiency and ease of implementation for large language model fine-tuning.
Findings
Effective in reducing memory and time during fine-tuning
Maintains accuracy after quantization and adaptation
Applicable to LLaMA and LLaMA2 models across various tasks
Abstract
Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM's weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated…
Peer Reviews
Decision·ICLR 2024 poster
**Addresses a Significant Issue** - QLoRA's potential is realized through its ability to quantize LoRA weights, effectively resolving the disparities observed between fine-tuning and inference in QLoRA. **Streamlined Implementation** - The authors highlight the method's simplicity, emphasizing that it necessitates a mere two lines of code modification to yield impressive enhancements. **Thorough Assessment** - The evaluation is meticulous, with the authors examining a spectrum of competitive m
Reasoning behind the method - Wy should all the c_ij as defined in the paper be equal is not clear — which is the main motivation for the group-wise quantisation. I would be willing to improve the scores with better explanation on the explanation of the method (See the questions)
- This work solves a limitation of previous parameter-efficient tuning of LLMs by eliminating the need for a separate post-training quantization which drops model accuracy - QA-LoRA further enhances memory efficiency of SOTA while preserving accuracy - The experiments are convincing as they cover a wide range of scenarios
QA-LoRA introduce a hyper-parameter (L: Group size). This requires additional optimization and It is unclear if it can be selected without tuning.
* The paper organization, presentation, and references are good. * The proposed method has enough novelty.
* Parameter offset in experiments: The proposed method incorporates group-wise/sub-channel qunatization, which includes an additional number of parameters for scales. Also, the proposed QA-LoRA reduces the size of low-rank matrices. However, these parameter offsets are not reflected in the results, which could be misleading to the audiences. It would be more informative to add the actual model size (or estimated) in MB/GB for each of the models. * In the ablation study, only group size is examin
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
