VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large   Language Models

Yifei Liu; Jicheng Wen; Yang Wang; Shengyu Ye; Li Lyna Zhang; Ting; Cao; Cheng Li; Mao Yang

arXiv:2409.17066·cs.AI·October 23, 2024

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting, Cao, Cheng Li, Mao Yang

PDF

Open Access 1 Repo 10 Models 1 Video

TL;DR

This paper introduces VPTQ, a novel vector post-training quantization method using second-order optimization for extremely low-bit LLMs, achieving significant compression and accuracy improvements with minimal additional computation.

Contribution

VPTQ is the first to apply second-order optimization to vector quantization for ultra low-bit LLM quantization, improving accuracy and efficiency over existing methods.

Findings

01

Reduces model perplexity by 0.01-0.34 on LLaMA-2

02

Achieves 4.41-7.34 perplexity reduction on LLaMA-3

03

Increases inference throughput by 1.6-1.8 times

Abstract

Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables. In this paper, we introduce Vector Post-Training Quantization (VPTQ) for extremely low-bit quantization of LLMs. We use Second-Order Optimization to formulate the LLM VQ problem and guide our quantization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/vptq
pytorchOfficial

Models

Videos

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques