PoTPTQ: A Two-step Power-of-Two Post-training for LLMs
Xinyu Wang, Vahid Partovi Nia, Peng Lu, Jerry Huang, Xiao-Wen Chang, Boxing Chen, Yufei Cui

TL;DR
This paper introduces PoTPTQ, a two-step post-training quantization method for LLMs that improves accuracy at low bit-widths and accelerates inference, making deployment more resource-efficient especially on GPUs.
Contribution
The paper presents a novel two-step PoT quantization framework that enhances low-precision accuracy and dequantization speed for large language models.
Findings
Outperforms state-of-the-art accuracy in 2- and 3-bit quantization.
Achieves 3.67x speedup on NVIDIA V100 and 1.63x on RTX 4090.
Enables faster inference with minimal calibration data.
Abstract
Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, their deployment is challenging due to the substantial computational resources required. Power-of-two (PoT) quantization is a general tool to counteract this difficulty. Albeit previous works on PoT quantization can be efficiently dequantized on CPUs using fixed-point addition, it showed less effectiveness on GPUs. The reason is entanglement of the sign bit and sequential bit manipulations needed for dequantization. We propose a novel POT quantization framework for LLM weights that (i) outperforms state-of-the-art accuracy in extremely low-precision number formats, and (ii) enables faster inference through more efficient dequantization. To maintain the accuracy of the quantized model, we introduce a two-step post-training algorithm: (i) initialize the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Library Science and Information Systems
