GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

Alireza Dadgarnia; Soroush Tabesh; Mahdi Nikdan; Michael Helcig; Eldar Kurtic; Maximilian Kleinegger; Dan Alistarh

arXiv:2604.18556·cs.CL·May 18, 2026

GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

Alireza Dadgarnia, Soroush Tabesh, Mahdi Nikdan, Michael Helcig, Eldar Kurtic, Maximilian Kleinegger, Dan Alistarh

PDF

1 Repo 9 Models

TL;DR

GSQ introduces a novel Gumbel-Softmax-based scalar quantization method that significantly improves accuracy for low-bit LLM deployment, bridging the gap with more complex vector quantization techniques.

Contribution

The paper presents GSQ, a scalable, high-accuracy scalar quantization approach using Gumbel-Softmax relaxation, applicable to large models and compatible with existing inference kernels.

Findings

01

GSQ closes most of the accuracy gap at 2-3 bits compared to vector-quantized methods.

02

GSQ improves accuracy on Llama models and GGUF checkpoints.

03

GSQ scales effectively to trillion-scale Mixture-of-Experts models.

Abstract

Quantization has become a standard tool for efficient LLM deployment, especially for local inference, where models are now routinely served at 2-3 bits per parameter. The state of the art is currently split into simple scalar quantization techniques, such as GPTQ or AWQ, which are widely deployed but plateau in accuracy at 3-4 bits per parameter (bpp), and "second-generation" vector- or trellis-quantized methods, such as QTIP, GPTVQ and AQLM, which push the accuracy frontier but are notoriously hard to implement and to scale. In this paper, we ask whether this gap is fundamental, or whether a carefully optimized $scalar$ quantizer can recover most of it. We answer in the affirmative, by introducing GSQ (Gumbel-Softmax Quantization), a post-training scalar quantization method which jointly learns the per-coordinate grid assignments and the per-group scales using a Gumbel-Softmax…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

IST-DASLab/GSQ
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.