TL;DR
FALQON is a framework that accelerates LoRA fine-tuning of large language models by directly merging adapters into an FP8-quantized backbone, reducing overhead and achieving significant speedups without sacrificing accuracy.
Contribution
FALQON introduces a novel method to eliminate quantization overhead in LoRA fine-tuning by merging adapters into FP8-quantized models, enabling faster training.
Findings
Achieves approximately 3× training speedup over existing quantized LoRA methods.
Maintains similar accuracy levels compared to traditional methods.
Enables end-to-end FP8 workflow without post-training quantization.
Abstract
Low-bit floating-point (FP) formats, such as FP8, provide significant acceleration and memory savings in model training thanks to native hardware support on modern GPUs and NPUs. However, we analyze that FP8 quantization offers speedup primarily for large-dimensional matrix multiplications, while inherent quantization overheads diminish speedup when applied to low-rank adaptation (LoRA), which uses small-dimensional matrices for efficient fine-tuning of large language models (LLMs). To address this limitation, we propose FALQON, a novel framework that eliminates the quantization overhead from separate LoRA computational paths by directly merging LoRA adapters into an FP8-quantized backbone during fine-tuning. Furthermore, we reformulate the forward and backward computations for merged adapters to significantly reduce quantization overhead, and introduce a row-wise proxy update mechanism…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
