ApiQ: Finetuning of 2-Bit Quantized Large Language Model
Baohao Liao, Christian Herold, Shahram Khadivi, Christof Monz

TL;DR
ApiQ introduces a novel quantization framework that preserves model knowledge during low-bit finetuning of large language models, leading to improved performance across diverse tasks and bit-widths.
Contribution
The paper presents ApiQ, a new quantization method that initializes LoRA components and quantizes weights simultaneously to reduce information loss during low-bit LLM finetuning.
Findings
ApiQ minimizes activation error during quantization.
ApiQ achieves superior finetuning results across various bit-widths.
ApiQ maintains activation precision while reducing error propagation.
Abstract
Memory-efficient finetuning of large language models (LLMs) has recently attracted huge attention with the increasing size of LLMs, primarily due to the constraints posed by GPU memory limitations and the effectiveness of these methods compared to full finetuning. Despite the advancements, current strategies for memory-efficient finetuning, such as QLoRA, exhibit inconsistent performance across diverse bit-width quantizations and multifaceted tasks. This inconsistency largely stems from the detrimental impact of the quantization process on preserved knowledge, leading to catastrophic forgetting and undermining the utilization of pretrained models for finetuning purposes. In this work, we introduce a novel quantization framework, ApiQ, designed to restore the lost information from quantization by concurrently initializing the LoRA components and quantizing the weights of LLMs. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Topic Modeling · Speech Recognition and Synthesis
