Unifying Block-wise PTQ and Distillation-based QAT for Progressive Quantization toward 2-bit Instruction-Tuned LLMs

Jung Hyun Lee; Seungjae Shin; Vinnam Kim; Jaeseong You; An Chen

arXiv:2506.09104·cs.LG·June 12, 2025

Unifying Block-wise PTQ and Distillation-based QAT for Progressive Quantization toward 2-bit Instruction-Tuned LLMs

Jung Hyun Lee, Seungjae Shin, Vinnam Kim, Jaeseong You, An Chen

PDF

Open Access

TL;DR

This paper introduces UPQ, a unified progressive quantization framework that combines block-wise PTQ and distillation-based QAT to effectively quantize instruction-tuned LLMs to 2-bit, achieving state-of-the-art performance without proprietary data.

Contribution

The paper presents UPQ, the first framework to successfully quantize instruction-tuned LLMs to 2-bit using a combination of PTQ and distillation-based QAT.

Findings

01

Achieves state-of-the-art results on MMLU and IFEval benchmarks.

02

Quantizes open-source instruction-tuned LLMs to 2-bit without proprietary data.

03

Reduces quantization error through block-wise PTQ before applying distillation-based QAT.

Abstract

As the rapid scaling of large language models (LLMs) poses significant challenges for deployment on resource-constrained devices, there is growing interest in extremely low-bit quantization, such as 2-bit. Although prior works have shown that 2-bit large models are pareto-optimal over their 4-bit smaller counterparts in both accuracy and latency, these advancements have been limited to pre-trained LLMs and have not yet been extended to instruction-tuned models. To bridge this gap, we propose Unified Progressive Quantization (UPQ) $-$ a novel progressive quantization framework (FP16 $\to$ INT4 $\to$ INT2) that unifies block-wise post-training quantization (PTQ) with distillation-based quantization-aware training (Distill-QAT) for INT2 instruction-tuned LLM quantization. UPQ first quantizes FP16 instruction-tuned models to INT4 using block-wise PTQ to significantly reduce the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Data Compression Techniques · Speech Recognition and Synthesis