TL;DR
This paper introduces Q-BLoRA and QA-BLoRA, novel methods for fine-tuning and deploying quantized large language models that improve accuracy and efficiency by balancing adapter complexity and trainability.
Contribution
It proposes balanced low-rank adaptation techniques that enhance fine-tuning and low-precision deployment of quantized LLMs, addressing performance degradation issues.
Findings
Q-BLoRA achieves state-of-the-art accuracy in fine-tuning quantized LLMs.
QA-BLoRA enables effective low-precision inference models.
Both methods outperform existing baselines in various scenarios.
Abstract
Large Language Models (LLMs) have demonstrated impressive performance across various domains. However, the enormous number of model parameters makes fine-tuning challenging, significantly limiting their application and deployment. Existing solutions combine parameter quantization with Low-Rank Adaptation (LoRA), reducing memory usage but causing performance degradation. Additionally, converting fine-tuned models to low-precision representations further degrades performance. In this paper, we identify an imbalance in fine-tuning quantized LLMs with LoRA: overly complex adapter inputs and outputs versus low effective trainability of the adapter, leading to underfitting during fine-tuning. Thus, we propose Quantized LLMs fine-tuning with Balanced Low-Rank Adaptation (Q-BLoRA), which simplifies the adapter's inputs and outputs while increasing the adapter's rank to alleviate underfitting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAdapter · LLaMA · ALIGN
