MatGPTQ: Accurate and Efficient Post-Training Matryoshka Quantization

Maximilian Kleinegger; Elvir Crn\v{c}evi\'c; Dan Alistarh

arXiv:2602.03537·cs.LG·February 4, 2026

MatGPTQ: Accurate and Efficient Post-Training Matryoshka Quantization

Maximilian Kleinegger, Elvir Crn\v{c}evi\'c, Dan Alistarh

PDF

Open Access 6 Models

TL;DR

MatGPTQ introduces a novel post-training quantization method that efficiently produces a multi-precision, sliceable model from a single calibration set, enabling flexible deployment across various memory and latency constraints.

Contribution

It presents the first practical, open-source PTQ pipeline for Matryoshka quantization, optimizing a single model for multiple precisions with a novel multi-precision objective and efficient kernels.

Findings

01

Preserves high-bit accuracy across models.

02

Significantly improves low-bit performance.

03

Enables practical multi-precision deployment from one checkpoint.

Abstract

Matryoshka Quantization (MatQuant) is a recent quantization approach showing that a single integer-quantized model can be served across multiple precisions, by slicing the most significant bits (MSB) at inference time. This enables a single checkpoint to cover a wide range of memory and latency budgets, but renders quantization much more challenging. In particular, the initial MatQuant relies on expensive quantization-aware training (QAT) variants, rather than fast one-shot post training quantization (PTQ), and lacks open-source and kernel support. We address all of these limitations by introducing Post-Training Matryoshka Quantization (MatGPTQ), a new PTQ pipeline that produces a single parent model jointly optimized for multiple target precisions in one-shot, based on a small calibration set. MatGPTQ casts Matryoshka quantization as a multi-precision objective with bit-slicing and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Data Compression Techniques · Parallel Computing and Optimization Techniques