Beacon: Post-Training Quantization with Integrated Grid Selection
Shihao Zhang, Rayan Saab

TL;DR
Beacon introduces a straightforward, tuning-free post-training quantization method that automatically determines optimal scaling factors, enabling efficient and competitive model compression without extensive calibration or heuristic tuning.
Contribution
It presents a novel algorithm for per-channel PTQ that eliminates manual tuning by leveraging the geometry of scalar quantization, simplifying the quantization process.
Findings
Achieves competitive accuracy with state-of-the-art methods
Does not require back-propagation or large calibration sets
Simplifies post-training quantization process
Abstract
Quantization is a widely used compression technique for reducing the memory and computation costs of large pre-trained models. A key challenge in per-channel post-training quantization (PTQ) is selecting appropriate scaling factors to replace weight values with values from a scaled integer grid. Existing methods typically fix the scale at the outset via heuristic tuning or grid search. We propose Beacon, a simple and effective algorithm that eliminates the need for such manual tuning. Beacon performs per-channel PTQ directly using an unscaled grid and automatically determines the optimal scaling factors by exploiting the geometry of scalar quantization. It does not rely on back-propagation or large calibration sets. Despite its simplicity and tuning-free nature, Beacon achieves competitive performance compared to state-of-the-art methods, making it a practical solution for efficient…
| Beacon w/o E.C. (before/after LN) | |
| 1.58-bit(K=6) | 67.49 / 72.82 |
| 2-bit(K=6) | 75.01 / 77.19 |
| 3-bit(K=5) | 80.53 / 80.69 |
| 4-bit(K=4) | 81.35 / 81.40 |
| runtime |
| 2-bit | 3-bit | 4-bit | |
| GPTQ (be/af LN) | 20.48 / 15.15 | 1.81 / 1.56 | 0.42 / 0.32 |
| COMQ | 4.85 | 1.52 | 0.59 |
| Beacon | 4.55 | 1.05 | 0.34 |
| w/o E.C. | w/ E.C. | GPTQ | GPTQ* | |
| 2-bit | 80.67 | 80.76 | 78.10 | 78.28 |
| 1.58-bit | 77.07 | 77.32 | 46.25 | 72.74 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Beacon: Post-Training Quantization with Integrated Grid Selection
Shihao Zhang, Rayan Saab
Abstract.
Quantization is a widely used compression technique for reducing the memory and computation costs of large pre-trained models. A key challenge in per-channel post-training quantization (PTQ) is selecting appropriate scaling factors to replace weight values with values from a scaled integer grid. Existing methods typically fix the scale at the outset via heuristic tuning or grid search. We propose Beacon, a simple and effective algorithm that eliminates the need for such manual tuning. Beacon performs per-channel PTQ directly using an unscaled grid and automatically determines the optimal scaling factors by exploiting the geometry of scalar quantization. It does not rely on back-propagation or large calibration sets. Despite its simplicity and tuning-free nature, Beacon achieves competitive performance compared to state-of-the-art methods, making it a practical solution for efficient model deployment.
This work was partially supported by NSF grant DMS-2410717.
Shihao Zhang is with the Department of Mathematics, UC San Diego ([email protected]). Rayan Saab is with the Department of Mathematics and HDSI, UC San Diego ([email protected]).
1. Introduction
Recent deep neural networks—most notably large language models (LLMs)—have massive computational and memory demands. Quantization [7, 11, 17] has emerged as a mainstream compression technique [3, 5, 25] for deploying LLMs on resource-constrained devices and in other resource-constrained settings. By reducing the number of bits, or bit width, used to represent weights or activations, quantization lowers storage requirements, memory bandwidth, and computation costs. Post-training quantization (PTQ) [6, 24, 2, 23] is a particularly attractive approach for its simplicity. It avoids backpropagation and adapts pre-trained models usually in a single pass or in very few passes over a small calibration data set. So, it incurs far less computational overhead than quantization-aware training (QAT) [10, 19, 27], where quantized models are trained directly with gradient-based methods.
In practice, PTQ is most commonly implemented with scalar quantization, where each weight is ultimately represented by a value from a finite scalar alphabet, such as a uniform grid. While modern scalar PTQ algorithms such as GPTQ [6], GPFQ [14, 24], and Qronos [26] incorporate dependencies across weights during the quantization process, the quantization function itself still acts coordinatewise, mapping real values to scalar representatives. Scalar methods have become the industry standard and perform reliably at bit widths of 4 or more. However, pushing them into the ultra-low bit regime ( bits) remains difficult, even with transformation-based enhancements [20, 12, 1, 13]. An alternative is vector quantization (VQ), where groups of weights are jointly mapped to codewords from a vector codebook rather than to a scalar alphabet. VQ offers accuracy gains at ultra-low precision but at the expense of higher complexity and reduced deployment efficiency. In contrast, our method remains within the scalar quantization paradigm, with its minimal inference overhead, while achieving strong accuracy improvements at ultra-low bit widths.
Notation. We denote the weight matrix of a layer by , where each of the columns is an -dimensional channel . Given a vector , we use for its -th entry, for the subvector , and we define analogously. is the Euclidean norm of . Given a matrix , we use to denote its -th column. We use to denote the submatrix .
The standard -bit integer grid is , and weight quantization uses its scaled and shifted version where is the scaling factor and is the offset (zero point). In asymmetric quantization, is typically defined per channel on a scaled min–max grid, The associated round-to-nearest (RTN) operator is
[TABLE]
where .
This paper considers asymmetric per-channel PTQ. For each channel (column) of , we fix its unscaled integer grid where we follow the standard choice of zero point
[TABLE]
throughout this paper. Each channel is quantized using a scaling factor . Collecting these gives a vector of per-channel scaling factors for .
Related Work. In per-channel PTQ, accurate scaling is critical for preserving model quality, particularly at ultra-low bit widths. Most scalar PTQ rounding methods [24, 6, 2] determine by tuning and once at initialization and keeping them fixed thereafter. A recent exception is the method of Zhang et al. [22], which updates during its iterations but is highly sensitive to the initial choices of and . Other approaches attempt to refine scaling through trial-and-error, via grid search over a finite range [26, 21], or through simple heuristics, such as line search over the mean squared error between full-precision and quantized weights [1] or pre-activations [8].
Contributions. To our knowledge, no backpropagation-free algorithm currently performs per-channel quantization while automatically determining scaling factors. They all require a separate selection step for the scaling factors. In this work, we propose Beacon, an algorithm that carries out per-channel PTQ directly on the unscaled -bit grid and determines the optimal scaling at the end by exploiting the geometry of scalar quantization. Like other state-of-the-art PTQ methods, Beacon requires only a small calibration set and avoids backpropagation.
2. Preliminaries on Asymmetric Per-channel Quantization
Given a calibration matrix , a PTQ algorithm typically seeks a quantized weight matrix that minimizes the layer-wise reconstruction error. Specifically, in per-channel quantization, each column of is associated with its own scaling factor, so that each column of is drawn from an integer grid and rescaled by a diagonal matrix of scales . Thus the goal is to solve
[TABLE]
Here we slightly abuse our notation by using the same symbol for all columns, when each column has its own zero point given by (1). Because the Frobenius norm decomposes as a sum of squared column errors, the problem separates across channels, which can be handled in parallel. Thus, it suffices to study a single column , leading to the per-channel objective
[TABLE]
Although this problem is NP-hard in general [9], the subproblem of optimizing for a fixed admits a closed-form solution.
Proposition 2.1**.**
For any , the optimal associated with the least squres objective in (3) is
[TABLE]
Proof.
Equation 3 is a least square problem in the one-dimensional real variable when is fixed. Taking derivatives with respect to , the optimality condition is , which implies the result. ∎
Corollary 2.2**.**
The global optimizer to (3) must satisfy the fixed point equation
[TABLE]
3. Beacon: PTQ with Automatic Per-channel Scaling
By Proposition 2.1, the optimal scaling constant for a chosen is uniquely given by (4). Substituting this into (3) eliminates and yields the equivalent objective
[TABLE]
Since where is the projection onto , the problem reduces to
[TABLE]
Since is fixed, the objective is further equivalent to
[TABLE]
where We can drop the absolute value and attempt to solve
[TABLE]
This formulation reveals the geometry of scalar quantization: aligning the directions of and is all you need.
**Beacon. **Inspired by the greedy path-following algorithm of Lybrand and Saab [14], Beacon starts by adopting a greedy approach to sequentially assign each for . Suppose have already been chosen. The next coordinate is initialized by
[TABLE]
This produces an initial vector with objective value
[TABLE]
We then refine by cyclically updating each coordinate. At step of the th cycle (loop), with all other coordinates fixed, the update rule is that must belong to
[TABLE]
After the -th full sweep of updates, we denote the resulting vector by and the corresponding objective value by
[TABLE]
Proposition 3.1**.**
The sequence converges in a finite number of iterations.
Proof.
By design of the update procedure, is a non-decreasing sequence satisfying . By the monotone convergence theorem, it must converge. Moreover, since , the sequence can take at most distinct values. Let , then must be a finite subset of natural numbers. Let be the maximal index for which , and we have for all . ∎
Our experiments suggest that the quality of the quantized model improves during the first few -loops and then plateaus, with the best results typically reached after 4–6 loops. The final step of Beacon is to use the resulting to compute the optimal scaling factor via (4).
We also empirically observe that sorting the columns of by increasing norm order in the initial assignment of (and by decreasing norm order in the refinement updates) improve the result upon natural index order. We present an intuition behind this observation. Let and . When determining , we are selecting to maximize . If is dominated by , will be largely determined by , leaving little room for optimization. Thus, having the columns of in increasing norm order aligns better with Beacon.
**Memory Efficient Implementation. **We observe that the angle between and is rotation invariant. Let be the QR decomposition of . Then we have , which reduces the problem from dealing with an often very tall matrix to a square matrix .
**Handling Error Accumulation. ** Quantizing weights in earlier layers affects the inputs to subsequent layers. Let denote the calibration set of samples (e.g., tokens) from the original pre-trained model, and let denote its counterpart from the partially quantized model. To account for the propagation of quantization error [26], one can address the mismatch between and by approximately solving
[TABLE]
Beacon can be generalized in a memory-efficient way to handle distinct inputs and . Let be the QR decomposition of . Then we seek
[TABLE]
Thus, the problem again reduces from approximately solving to so that we work with the square matrices and rather than the potentially tall matrices and . We refer to this variant as Beacon with error correction. Its memory-efficient implementation is summarized in the algorithm below. In the special case without error correction, there is only one input , and we simply set .
**Normalization Tuning. **A common practice in PTQ is to add a lightweight training step to tune the unquantized parameters in batch normalization (BN) or layer normalization (LN) layers, helping to compensate for quantization error. We evaluate the effect of normalization tuning in our experiments.
4. Experiments
We first test Beacon on the DeiT-B vision transformer model ([15], 86 million parameters) on the benchmark ImageNet classification task. To that end, we use DeiT-B ( version) from the Hugging Face timm library [18]. We focus on ILSVRC-2012 [4], a 1000-category dataset with 1.28 million training images and 50 thousand validation images. All images in ILSVRC-2012 are preprocessed in the standard manner by resizing each image to and using the normalized center crop. We evaluate top-1 accuracy of the quantized models on the entire validation set. The original accuracy of DeiT-B is . We use batch size to generate the calibration data. We ran our experiments on a single Nvidia A100 GPU with 80G GPU memory.
Table 1 displays the result of quantizing DeiT-B via Beacon without error correction, as the number of bits varies. -bit quantization means the grid is a scaling and shifting of per channel. We display results before and after a lightweight LN tuning after the whole model has been quantized. is the number of loops we applied, and is fixed for each row. We remark that the best , numerically, for each variant (whether with or without error correction or LN tuning, the bit width) slightly differ, but are typically or . The runtime, compared to GPTQ with the same set up and machine, is reported in the last row. The LN tuning step only adds a small extra cost when training for epoch with data batches with batch size .
We compare Beacon to a recent state-of-the-art method for quantizing vision models, namely COMQ [22], and the standard baseline method GPTQ [6] in Table 2, again evaluated on DeiT-B. We implement GPTQ with asymmetric quantization on a standard per-channel min-max grid (, given by (1)) and use the reported result in Zhang et al. [22]. As the original accuracy of DeiT-B reported in Zhang et al. [22] is , which differs slightly from the we observe, here we compare the accuracy drop associated with each algorithm. The results suggest that fine grid tuning is critical to bit quantization, further highlighting the benefit of Beacon as a tuning free method for setting the scale. Indeed, Beacon achieves the best performance for the challenging 2-bit case. We also note from the open-sourced code that Zhang et al. [22] used the entire training dataset for LN tuning and carefully initialize the scaling parameters to be for 2-bit case. In contrast, Beacon only uses a few batches for LN tuning and, by construction, requires no hyper-parameter tuning.
Table 3 displays the result of quantizing DeiT-III-L ([16], 304 million parameters, original accuracy 84.59%) with LN tuning for all methods. The first two columns are for Beacon with , with and without error correction respectively. GPTQ is still implemented with asymmetric quantization on a per-channel min-max grid.
GPTQ* uses Beacon to generate its scaling factors: Given a channel and its unscaled grid with , we first run Beacon without error correction, giving us a scaling factor for this channel to replace the choice used in GPTQ. We observe that the scaling generated by Beacon significantly improves the 1.58-bit GPTQ quantized model to a usable quality, although Beacon still outperforms it. Table 1 and Table 3 both demonstrate that Beacon yields a usable model even in the extremely challenging 1.58-bit quantization setting.
5. Conclusion
We introduced Beacon, a simple and tuning-free algorithm for post-training quantization. Unlike existing approaches that require heuristic scale selection or iterative search, Beacon directly quantizes weights using an unscaled grid and infers optimal scaling factors per-channel after quantization. This streamlines the PTQ process and avoids dependence on hyperparameter tuning or back-propagation. Beacon yields models that retain compatibility with standard hardware and matches the performance of more complex state-of-the-art methods, despite its minimal calibration and computational overhead. We believe that this makes it a simple and effective solution for compressing large models in resource-constrained environments.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ashkboos et al. [2024] S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. Advances in Neural Information Processing Systems , 37:100213–100240, 2024.
- 2Cheng et al. [2024] W. Cheng, W. Zhang, H. Shen, Y. Cai, X. He, L. Kaokao, and Y. Liu. Optimize weight rounding via signed gradient descent for the quantization of llms. In Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 11332–11350, 2024.
- 3Cheng et al. [2018] Y. Cheng, D. Wang, P. Zhou, and T. Zhang. Model compression and acceleration for deep neural networks: The principles, progress, and challenges. IEEE Signal Processing Magazine , 35(1):126–136, 2018. doi: 10.1109/MSP.2017.2765695 .
- 4Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition , pages 248–255. Ieee, 2009.
- 5Deng et al. [2020] L. Deng, G. Li, S. Han, L. Shi, and Y. Xie. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proceedings of the IEEE , 108(4):485–532, 2020.
- 6Frantar et al. [2022] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. 2022.
- 7Gholami et al. [2022] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer. A survey of quantization methods for efficient neural network inference. In Low-power computer vision , pages 291–326. Chapman and Hall/CRC, 2022.
- 8Gong et al. [2024] R. Gong, Y. Yong, S. Gu, Y. Huang, C. Lv, Y. Zhang, X. Liu, and D. Tao. Llmc: Benchmarking large language model quantization with a versatile compression toolkit. ar Xiv preprint ar Xiv:2405.06001 , 2024.
