TL;DR
GSQ introduces a novel Gumbel-Softmax-based scalar quantization method that significantly improves accuracy for low-bit LLM deployment, bridging the gap with more complex vector quantization techniques.
Contribution
The paper presents GSQ, a scalable, high-accuracy scalar quantization approach using Gumbel-Softmax relaxation, applicable to large models and compatible with existing inference kernels.
Findings
GSQ closes most of the accuracy gap at 2-3 bits compared to vector-quantized methods.
GSQ improves accuracy on Llama models and GGUF checkpoints.
GSQ scales effectively to trillion-scale Mixture-of-Experts models.
Abstract
Quantization has become a standard tool for efficient LLM deployment, especially for local inference, where models are now routinely served at 2-3 bits per parameter. The state of the art is currently split into simple scalar quantization techniques, such as GPTQ or AWQ, which are widely deployed but plateau in accuracy at 3-4 bits per parameter (bpp), and "second-generation" vector- or trellis-quantized methods, such as QTIP, GPTVQ and AQLM, which push the accuracy frontier but are notoriously hard to implement and to scale. In this paper, we ask whether this gap is fundamental, or whether a carefully optimized quantizer can recover most of it. We answer in the affirmative, by introducing GSQ (Gumbel-Softmax Quantization), a post-training scalar quantization method which jointly learns the per-coordinate grid assignments and the per-group scales using a Gumbel-Softmax…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQmodel· 247 dl247 dl
- 🤗ISTA-DASLab/Llama-3.1-70B-Instruct-3Bit-GSQmodel· 138 dl138 dl
- 🤗ISTA-DASLab/Kimi-K2.5-2Bit-GSQmodel· 126 dl126 dl
- 🤗ISTA-DASLab/Kimi-K2.6-2Bit-GSQmodel· 103 dl103 dl
- 🤗ISTA-DASLab/Qwen3-4B-GGUF-GSQmodel· 180 dl180 dl
- 🤗ISTA-DASLab/Qwen3-8B-GGUF-GSQmodel· 211 dl· ♡ 1211 dl♡ 1
- 🤗ISTA-DASLab/Qwen3.6-35B-A3B-2Bit-GSQmodel· 1.8k dl1.8k dl
- 🤗ISTA-DASLab/Qwen3.5-4B-GGUF-GSQmodel· 352 dl352 dl
- 🤗mgoin/Qwen3.6-35B-A3B-2Bit-GSQ-ctmodel· 19 dl19 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
