LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning
Junyu Chen, Junzhuo Li, Zhen Peng, Wenjie Wang, Yuxiang Ren, Long Shi, Xuming Hu

TL;DR
LoTA-QAF introduces a lossless ternary adaptation method for quantization-aware fine-tuning of large language models, enabling efficient merging of adaptation weights into quantized models and improving performance on downstream tasks.
Contribution
The paper presents a novel lossless ternary adaptation technique that allows all quantized weights to be adjusted and merged without accuracy loss during fine-tuning.
Findings
Effectively recovers performance of quantized models on MMLU benchmark.
Outperforms 16-bit LoRA in accuracy improvements.
Validates effectiveness on multiple LLM families.
Abstract
Quantization and fine-tuning are crucial for deploying large language models (LLMs) on resource-constrained edge devices. However, fine-tuning quantized models presents significant challenges, primarily stemming from: First, the mismatch in data types between the low-precision quantized weights (e.g., 4-bit) and the high-precision adaptation weights (e.g., 16-bit). This mismatch limits the computational efficiency advantage offered by quantized weights during inference. Second, potential accuracy degradation when merging these high-precision adaptation weights into the low-precision quantized weights, as the adaptation weights often necessitate approximation or truncation. Third, as far as we know, no existing methods support the lossless merging of adaptation while adjusting all quantized weights. To address these challenges, we introduce lossless ternary adaptation for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMedical Imaging Techniques and Applications · CCD and CMOS Imaging Sensors · Advanced Memory and Neural Computing
