PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression
Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev,, Konstantin Burlachenko, Kai Yi, Dan Alistarh, Peter Richtarik

TL;DR
This paper introduces PV-Tuning, a novel fine-tuning framework for extreme quantization of large language models, surpassing prior methods in accuracy and efficiency, especially at 1-2 bits per parameter.
Contribution
PV-Tuning is a representation-agnostic, improved fine-tuning approach that outperforms existing methods and guarantees convergence, enabling Pareto-optimal quantization of Llama 2 models at 2 bits.
Findings
PV-Tuning outperforms prior quantization techniques on Llama and Mistral models.
Achieves Pareto-optimal 2-bit quantization for Llama 2 models.
Demonstrates the limitations of straight-through estimators in extreme LLM compression.
Abstract
There has been significant interest in "extreme" compression of large language models (LLMs), i.e., to 1-2 bits per parameter, which allows such models to be executed efficiently on resource-constrained devices. Existing work focused on improved one-shot quantization techniques and weight representations; yet, purely post-training approaches are reaching diminishing returns in terms of the accuracy-vs-bit-width trade-off. State-of-the-art quantization methods such as QuIP# and AQLM include fine-tuning (part of) the compressed parameters over a limited amount of calibration data; however, such fine-tuning techniques over compressed weights often make exclusive use of straight-through estimators (STE), whose performance is not well-understood in this setting. In this work, we question the use of STE for extreme LLM compression, showing that it can be sub-optimal, and perform a systematic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ISTA-DASLab/Llama-2-7b-AQLM-PV-2Bit-1x16-hfmodel· 10 dl10 dl
- 🤗ISTA-DASLab/Llama-2-13b-AQLM-PV-2Bit-1x16-hfmodel· 2 dl2 dl
- 🤗ISTA-DASLab/Llama-2-70b-AQLM-PV-2Bit-1x16-hfmodel
- 🤗ISTA-DASLab/Mistral-7B-v0.1-AQLM-PV-2Bit-1x16-hfmodel· 5 dl5 dl
- 🤗ISTA-DASLab/Phi-3-mini-4k-instruct-AQLM-PV-2Bit-1x16-hfmodel· 5 dl· ♡ 25 dl♡ 2
- 🤗ISTA-DASLab/Llama-2-7b-AQLM-PV-2Bit-2x8-hfmodel· 7 dl7 dl
- 🤗ISTA-DASLab/Meta-Llama-3-8B-AQLM-PV-2Bit-1x16model· 49 dl· ♡ 449 dl♡ 4
- 🤗ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-2Bit-1x16model· 3 dl· ♡ 43 dl♡ 4
- 🤗TechxGenus/deepseek-coder-7b-instruct-v1.5-PVmodel
- 🤗justheuristic/test-1bitmodel· 1 dl1 dl
Videos
Taxonomy
TopicsPhotovoltaic System Optimization Techniques
MethodsLLaMA
