MatGPTQ: Accurate and Efficient Post-Training Matryoshka Quantization
Maximilian Kleinegger, Elvir Crn\v{c}evi\'c, Dan Alistarh

TL;DR
MatGPTQ introduces a novel post-training quantization method that efficiently produces a multi-precision, sliceable model from a single calibration set, enabling flexible deployment across various memory and latency constraints.
Contribution
It presents the first practical, open-source PTQ pipeline for Matryoshka quantization, optimizing a single model for multiple precisions with a novel multi-precision objective and efficient kernels.
Findings
Preserves high-bit accuracy across models.
Significantly improves low-bit performance.
Enables practical multi-precision deployment from one checkpoint.
Abstract
Matryoshka Quantization (MatQuant) is a recent quantization approach showing that a single integer-quantized model can be served across multiple precisions, by slicing the most significant bits (MSB) at inference time. This enables a single checkpoint to cover a wide range of memory and latency budgets, but renders quantization much more challenging. In particular, the initial MatQuant relies on expensive quantization-aware training (QAT) variants, rather than fast one-shot post training quantization (PTQ), and lacks open-source and kernel support. We address all of these limitations by introducing Post-Training Matryoshka Quantization (MatGPTQ), a new PTQ pipeline that produces a single parent model jointly optimized for multiple target precisions in one-shot, based on a small calibration set. MatGPTQ casts Matryoshka quantization as a multi-precision objective with bit-slicing and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ISTA-DASLab/Llama-3.1-8B-Instruct-MatGPTQmodel· 1 dl1 dl
- 🤗ISTA-DASLab/Llama-3.1-8B-MatGPTQmodel· 3 dl3 dl
- 🤗ISTA-DASLab/Qwen3-8B-Base-MatGPTQmodel· 1 dl· ♡ 11 dl♡ 1
- 🤗ISTA-DASLab/Qwen3-8B-MatGPTQmodel· 4 dl4 dl
- 🤗ISTA-DASLab/Qwen3-14B-MatGPTQmodel· 4 dl4 dl
- 🤗ISTA-DASLab/Phi-3-medium-128k-instruct-MatGPTQmodel· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Data Compression Techniques · Parallel Computing and Optimization Techniques
