VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting, Cao, Cheng Li, Mao Yang

TL;DR
This paper introduces VPTQ, a novel vector post-training quantization method using second-order optimization for extremely low-bit LLMs, achieving significant compression and accuracy improvements with minimal additional computation.
Contribution
VPTQ is the first to apply second-order optimization to vector quantization for ultra low-bit LLM quantization, improving accuracy and efficiency over existing methods.
Findings
Reduces model perplexity by 0.01-0.34 on LLaMA-2
Achieves 4.41-7.34 perplexity reduction on LLaMA-3
Increases inference throughput by 1.6-1.8 times
Abstract
Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables. In this paper, we introduce Vector Post-Training Quantization (VPTQ) for extremely low-bit quantization of LLMs. We use Second-Order Optimization to formulate the LLM VQ problem and guide our quantization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗VPTQ-community/Meta-Llama-3.1-70B-Instruct-v16-k65536-32768-woftmodel· 1 dl1 dl
- 🤗VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-65536-woftmodel· 33 dl33 dl
- 🤗VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-4096-woftmodel· 7 dl7 dl
- 🤗VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-256-woftmodel· 27 dl· ♡ 127 dl♡ 1
- 🤗VPTQ-community/Qwen2.5-72B-Instruct-v16-k65536-65536-woftmodel· 17 dl· ♡ 417 dl♡ 4
- 🤗VPTQ-community/Meta-Llama-3.1-70B-Instruct-v16-k65536-65536-woftmodel· 2 dl2 dl
- 🤗VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-256-woftmodel· 29 dl· ♡ 129 dl♡ 1
- 🤗VPTQ-community/Qwen2.5-7B-Instruct-v8-k65536-256-woftmodel· 3 dl3 dl
- 🤗VPTQ-community/Qwen2.5-72B-Instruct-v16-k65536-32768-woftmodel· ♡ 3♡ 3
- 🤗VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woftmodel· 6 dl· ♡ 16 dl♡ 1
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
