L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models
Hyesung Jeon, Yulhwa Kim, Jae-joon Kim

TL;DR
L4Q introduces a memory-efficient quantization-aware training method combined with LoRA for large language models, achieving high accuracy in low-bit quantization while reducing training and inference costs.
Contribution
The paper presents L4Q, a novel integration of QAT and LoRA with a memory-optimized layer design, enabling fully quantized LLMs with minimal accuracy loss and comparable training costs to PEFT methods.
Findings
L4Q outperforms decoupled schemes in accuracy at 4-bit and 3-bit quantization.
L4Q maintains high accuracy with reduced memory overhead during training.
Experiments on LLaMA and Mistral demonstrate effectiveness in language tasks and few-shot learning.
Abstract
Due to the high memory and computational costs associated with large language models (LLMs), model compression techniques such as quantization, which reduces inference costs, and parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaptation (LoRA), which reduce training costs, have gained significant popularity. This trend has spurred active research into quantization-aware PEFT techniques, aimed at maintaining model accuracy while minimizing memory overhead during both inference and training. Previous quantization-aware PEFT methods typically apply post-training quantization (PTQ) to pre-trained LLMs, followed by PEFT to recover accuracy loss. Meanwhile, this approach has limitations in recovering the accuracy loss. In this paper, we propose L4Q, a method that integrates Quantization-Aware Training (QAT) with LoRA. By employing a memory-optimized layer design, L4Q…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- It is crucial to improve the efficiency of quantization-aware training on LLMs, as it can bring more accurate yet low-precision models. - The paper is well-written and easy to understand.
- The discussion of QAT-LoRA conceptually overlaps with some discussions in QA-LoRA, e.g., the precision mismatching of the LoRA module and dense layer. Moreover, simply inserting the LoRA module into the quantizer seems too straightforward, which further raises more undiscussed issues. Firstly, how are the LoRA modules inited? Since the LoRA modules are inside the quantizer, can we use quantization error to init the A and B, as in LoftQ? Secondly, why the training time costs are not increased c
* Doing memory efficient QAT and considering inference efficiency for PEFT is very relevant and timely. Only limited literature addresses the combination of both. * While the method is simple, it seems to be effective based on the authors evaluation. * It is good that they show the actual inference speedup during inference, this is not always the case in the quantization literature. * Paper is well written and easy to follow.
* The main weakness of this work is the limited technical novelty as L4Q can be seen as a straight forward integration of Lora and QAT (Lora adapter move insight round/clip of the quantization). That being said, I do acknowledge this can come with several engineering challenges to make it actually memory efficient in practice. * For the results, some baselines and literature comparisons are missing, e.g. traditional QAT (LSQ, LSQ+), other LLM-based QAT work (e. g. LLM-QAT [1]) and more recent PT
1. Combining QAT with LoRA is a great approach, as LoRA can be integrated with the quantized model without increasing inference time due to the retained LoRA branches. 2. The author provides specific implementation details and code to ensure the reproducibility of the method.
I believe the main weaknesses with this paper is that the experiments are insufficient to validate the effectiveness of the L4Q method, including but not limited to the following weaknesses: 1. The data in this paper does not match that of the original study. For example, LLaMA-1-7B with 4-bit quantization and group size is 128 achieves 38.4% accuracy on the MMLU (5-shot) benchmark using QA-LoRA (Table 6 in QA-LoRA, an improvement of 3.8% over the fp16 LLaMA-1-7B baseline), but in this paper, Q
1. Efficiently restoring the performance of quantized models is crucial. 2. The paper is well-written and easy to follow. 3. The proposed method is OK, and the experimental results are strong.
1. The method is incremental. I am not confident if the contributions are enough. 2. The proposed method seems sacrificed the training efficiency (training memory) for a better performance 3. High similarity with existing method QLLM [1] or at least a special case. [1] [1] QLLM: ACCURATE AND EFFICIENT LOW-BITWIDTH QUANTIZATION FOR LARGE LANGUAGE MODELS
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling
MethodsAdapter
