Unifying Block-wise PTQ and Distillation-based QAT for Progressive Quantization toward 2-bit Instruction-Tuned LLMs
Jung Hyun Lee, Seungjae Shin, Vinnam Kim, Jaeseong You, An Chen

TL;DR
This paper introduces UPQ, a unified progressive quantization framework that combines block-wise PTQ and distillation-based QAT to effectively quantize instruction-tuned LLMs to 2-bit, achieving state-of-the-art performance without proprietary data.
Contribution
The paper presents UPQ, the first framework to successfully quantize instruction-tuned LLMs to 2-bit using a combination of PTQ and distillation-based QAT.
Findings
Achieves state-of-the-art results on MMLU and IFEval benchmarks.
Quantizes open-source instruction-tuned LLMs to 2-bit without proprietary data.
Reduces quantization error through block-wise PTQ before applying distillation-based QAT.
Abstract
As the rapid scaling of large language models (LLMs) poses significant challenges for deployment on resource-constrained devices, there is growing interest in extremely low-bit quantization, such as 2-bit. Although prior works have shown that 2-bit large models are pareto-optimal over their 4-bit smaller counterparts in both accuracy and latency, these advancements have been limited to pre-trained LLMs and have not yet been extended to instruction-tuned models. To bridge this gap, we propose Unified Progressive Quantization (UPQ)a novel progressive quantization framework (FP16INT4INT2) that unifies block-wise post-training quantization (PTQ) with distillation-based quantization-aware training (Distill-QAT) for INT2 instruction-tuned LLM quantization. UPQ first quantizes FP16 instruction-tuned models to INT4 using block-wise PTQ to significantly reduce the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Data Compression Techniques · Speech Recognition and Synthesis
